Re: [CentOS] XML parsing in shell script

2021-03-19 Thread H
On 03/19/2021 12:40 PM, Fabian Arrotin wrote:
> On 18/03/2021 22:08, H wrote:
>> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>>> On Thu, 18 Mar 2021, H wrote:
>>>
 I have a challenge I am interested in getting feedback on.

 I will on a regular basis download a series of data files from the web 
 where the data is in XML-format. The format is known in advance but is 
 different between the various data files. I then plan to extract the 
 various data items ("elements?") from each data file, do some light 
 formatting and then save desired parts of each original data file as a 
 formatted CSV-file for later importing into a database.

 As the plan is to use a bash shell script using curl to get the files, I 
 have begun looking at external XML parsers that I can call from my script, 
 perhaps specify which elements I want, get the data back in some kind of 
 bash data structure and finally format and save as CSV-files.

 There seems to be a number of XML parsers available but perhaps someone on 
 the list has a recommendation for which one might suit my needs best? I 
 should add that I am running CentOS 7.
>>> Will you be using an XSLT stylesheet to do the work? There's a somewhat 
>>> steep learning curve, but in my experience it's the most reliable method 
>>> for parsing XML except in the very simplest of cases.
>>>
>>> In that case, the libxslt stuff may be what you want:
>>>
>>>   http://xmlsoft.org/libxslt/
>>>
>>> The command-line tool is xsltproc.
>>>
>>> Again, it's not easy to use, but once you've built a toolchain, it will be 
>>> reliable and fairly easy to modify if the source XML schema change.
>>>
>> I just checked and I cannot see that the organization publishing these data 
>> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
>> that the publisher of the data would be one with said stylesheet. (Although 
>> perhaps that is something an end-user could put together as well??)
>>
>> Although the data format of each data series is unique, it is simple and 
>> could conceivably be parsed using grep but I am looking for a more 
>> "forward-looking" solution for other applications in the future.
>>
>> If XSLT stylesheets are not available - would you suggest another tool? Or, 
>> would you suggest I design sheets, presumably one for for each data series?
>>
> I used in the past xmlstarlet (available in epel) for quick parsing from
> within bash scripts.
> For something more robust, maybe switch to python ? (ymmv)
>
I wanted to do this in bash and decided on calling xsltproc while investing in 
writing an XSLT stylesheet for each data file format.

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-19 Thread H
On 03/19/2021 03:25 PM, Leon Fauster via CentOS wrote:
> Am 19.03.21 um 17:40 schrieb Fabian Arrotin:
>> On 18/03/2021 22:08, H wrote:
>>> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
 On Thu, 18 Mar 2021, H wrote:

> I have a challenge I am interested in getting feedback on.
>
> I will on a regular basis download a series of data files from the web 
> where the data is in XML-format. The format is known in advance but is 
> different between the various data files. I then plan to extract the 
> various data items ("elements?") from each data file, do some light 
> formatting and then save desired parts of each original data file as a 
> formatted CSV-file for later importing into a database.
>
> As the plan is to use a bash shell script using curl to get the files, I 
> have begun looking at external XML parsers that I can call from my 
> script, perhaps specify which elements I want, get the data back in some 
> kind of bash data structure and finally format and save as CSV-files.
>
> There seems to be a number of XML parsers available but perhaps someone 
> on the list has a recommendation for which one might suit my needs best? 
> I should add that I am running CentOS 7.

 Will you be using an XSLT stylesheet to do the work? There's a somewhat 
 steep learning curve, but in my experience it's the most reliable method 
 for parsing XML except in the very simplest of cases.

 In that case, the libxslt stuff may be what you want:

    http://xmlsoft.org/libxslt/

 The command-line tool is xsltproc.

 Again, it's not easy to use, but once you've built a toolchain, it will be 
 reliable and fairly easy to modify if the source XML schema change.

>>> I just checked and I cannot see that the organization publishing these data 
>>> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
>>> that the publisher of the data would be one with said stylesheet. (Although 
>>> perhaps that is something an end-user could put together as well??)
>>>
>>> Although the data format of each data series is unique, it is simple and 
>>> could conceivably be parsed using grep but I am looking for a more 
>>> "forward-looking" solution for other applications in the future.
>>>
>>> If XSLT stylesheets are not available - would you suggest another tool? Or, 
>>> would you suggest I design sheets, presumably one for for each data series?
>>>
>>
>> I used in the past xmlstarlet (available in epel) for quick parsing from
>> within bash scripts.
>> For something more robust, maybe switch to python ? (ymmv)
>>
>
>
>
> just for a value grep use xmllint (its in libxml2 package):
>
> Example:
>
> XML input:
>
>  ?>OK
>
>
> bash var:
>
> STATUS=$(echo ${RESPONSE} | xmllint --format --xpath 
> "//methodResponse/params/param/value/string/text()" - 2>/dev/null)
>
>
> -- 
> Leon
>
> ___
> CentOS mailing list
> CentOS@centos.org
> https://lists.centos.org/mailman/listinfo/centos

Thank you, I decided to put together an XSLT stylesheet for each data file 
format, I think this might be the best for the future.

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-19 Thread H
On 03/18/2021 08:18 PM, H wrote:
> On 03/18/2021 05:53 PM, Paul Heinlein wrote:
>> On Thu, 18 Mar 2021, H wrote:
>>
>>> I just checked and I cannot see that the organization publishing these data 
>>> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
>>> that the publisher of the data would be one with said stylesheet. (Although 
>>> perhaps that is something an end-user could put together as well??)
>> Some high-profile XML schemata (e.g., DocBook) have published stylesheets, 
>> but mostly I've written my own. I have a very trivial example in a blog post 
>> from several years ago:
>>
>>   https://www.madboa.com/blog/2014/09/10/strip-rss/
>>
>> (My site is completely non-commercial. I gain nothing by you visiting it -- 
>> or ignoring it.)
>>
> I looked at your link above and the the one in your previous e-mail - looks 
> very promising!
>
> I will take a look at creating a XSLT stylesheet over the weekend and try 
> creating a CSV-file in the desired format.
>
> Thank you!
>
I created a XSLT stylesheet for the data file I tried this on and it worked 
beautifully. I think the extra time spent designing a stylesheet is time well 
spent for any future changes to the data format.

Thank you!

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-19 Thread Leon Fauster via CentOS

Am 19.03.21 um 17:40 schrieb Fabian Arrotin:

On 18/03/2021 22:08, H wrote:

On 03/18/2021 04:30 PM, Paul Heinlein wrote:

On Thu, 18 Mar 2021, H wrote:


I have a challenge I am interested in getting feedback on.

I will on a regular basis download a series of data files from the web where the data is 
in XML-format. The format is known in advance but is different between the various data 
files. I then plan to extract the various data items ("elements?") from each 
data file, do some light formatting and then save desired parts of each original data 
file as a formatted CSV-file for later importing into a database.

As the plan is to use a bash shell script using curl to get the files, I have 
begun looking at external XML parsers that I can call from my script, perhaps 
specify which elements I want, get the data back in some kind of bash data 
structure and finally format and save as CSV-files.

There seems to be a number of XML parsers available but perhaps someone on the 
list has a recommendation for which one might suit my needs best? I should add 
that I am running CentOS 7.


Will you be using an XSLT stylesheet to do the work? There's a somewhat steep 
learning curve, but in my experience it's the most reliable method for parsing 
XML except in the very simplest of cases.

In that case, the libxslt stuff may be what you want:

   http://xmlsoft.org/libxslt/

The command-line tool is xsltproc.

Again, it's not easy to use, but once you've built a toolchain, it will be 
reliable and fairly easy to modify if the source XML schema change.


I just checked and I cannot see that the organization publishing these data 
files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that 
the publisher of the data would be one with said stylesheet. (Although perhaps 
that is something an end-user could put together as well??)

Although the data format of each data series is unique, it is simple and could 
conceivably be parsed using grep but I am looking for a more "forward-looking" 
solution for other applications in the future.

If XSLT stylesheets are not available - would you suggest another tool? Or, 
would you suggest I design sheets, presumably one for for each data series?



I used in the past xmlstarlet (available in epel) for quick parsing from
within bash scripts.
For something more robust, maybe switch to python ? (ymmv)





just for a value grep use xmllint (its in libxml2 package):

Example:

XML input:

?>OK



bash var:

STATUS=$(echo ${RESPONSE} | xmllint --format --xpath 
"//methodResponse/params/param/value/string/text()" - 2>/dev/null)



--
Leon

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-19 Thread Fabian Arrotin
On 18/03/2021 22:08, H wrote:
> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>> On Thu, 18 Mar 2021, H wrote:
>>
>>> I have a challenge I am interested in getting feedback on.
>>>
>>> I will on a regular basis download a series of data files from the web 
>>> where the data is in XML-format. The format is known in advance but is 
>>> different between the various data files. I then plan to extract the 
>>> various data items ("elements?") from each data file, do some light 
>>> formatting and then save desired parts of each original data file as a 
>>> formatted CSV-file for later importing into a database.
>>>
>>> As the plan is to use a bash shell script using curl to get the files, I 
>>> have begun looking at external XML parsers that I can call from my script, 
>>> perhaps specify which elements I want, get the data back in some kind of 
>>> bash data structure and finally format and save as CSV-files.
>>>
>>> There seems to be a number of XML parsers available but perhaps someone on 
>>> the list has a recommendation for which one might suit my needs best? I 
>>> should add that I am running CentOS 7.
>>
>> Will you be using an XSLT stylesheet to do the work? There's a somewhat 
>> steep learning curve, but in my experience it's the most reliable method for 
>> parsing XML except in the very simplest of cases.
>>
>> In that case, the libxslt stuff may be what you want:
>>
>>   http://xmlsoft.org/libxslt/
>>
>> The command-line tool is xsltproc.
>>
>> Again, it's not easy to use, but once you've built a toolchain, it will be 
>> reliable and fairly easy to modify if the source XML schema change.
>>
> I just checked and I cannot see that the organization publishing these data 
> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
> that the publisher of the data would be one with said stylesheet. (Although 
> perhaps that is something an end-user could put together as well??)
> 
> Although the data format of each data series is unique, it is simple and 
> could conceivably be parsed using grep but I am looking for a more 
> "forward-looking" solution for other applications in the future.
> 
> If XSLT stylesheets are not available - would you suggest another tool? Or, 
> would you suggest I design sheets, presumably one for for each data series?
> 

I used in the past xmlstarlet (available in epel) for quick parsing from
within bash scripts.
For something more robust, maybe switch to python ? (ymmv)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-18 Thread H
On 03/18/2021 05:53 PM, Paul Heinlein wrote:
> On Thu, 18 Mar 2021, H wrote:
>
>> I just checked and I cannot see that the organization publishing these data 
>> files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming 
>> that the publisher of the data would be one with said stylesheet. (Although 
>> perhaps that is something an end-user could put together as well??)
>
> Some high-profile XML schemata (e.g., DocBook) have published stylesheets, 
> but mostly I've written my own. I have a very trivial example in a blog post 
> from several years ago:
>
>   https://www.madboa.com/blog/2014/09/10/strip-rss/
>
> (My site is completely non-commercial. I gain nothing by you visiting it -- 
> or ignoring it.)
>
I looked at your link above and the the one in your previous e-mail - looks 
very promising!

I will take a look at creating a XSLT stylesheet over the weekend and try 
creating a CSV-file in the desired format.

Thank you!

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-18 Thread Paul Heinlein

On Thu, 18 Mar 2021, H wrote:

I just checked and I cannot see that the organization publishing 
these data files offer any XSLT stylesheet. IOW, I am, perhaps 
incorrectly, assuming that the publisher of the data would be one 
with said stylesheet. (Although perhaps that is something an 
end-user could put together as well??)


Some high-profile XML schemata (e.g., DocBook) have published 
stylesheets, but mostly I've written my own. I have a very trivial 
example in a blog post from several years ago:


  https://www.madboa.com/blog/2014/09/10/strip-rss/

(My site is completely non-commercial. I gain nothing by you visiting 
it -- or ignoring it.)


--
Paul Heinlein
heinl...@madboa.com
45.38° N, 122.59° W
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-18 Thread H
On 03/18/2021 04:30 PM, Paul Heinlein wrote:
> On Thu, 18 Mar 2021, H wrote:
>
>> I have a challenge I am interested in getting feedback on.
>>
>> I will on a regular basis download a series of data files from the web where 
>> the data is in XML-format. The format is known in advance but is different 
>> between the various data files. I then plan to extract the various data 
>> items ("elements?") from each data file, do some light formatting and then 
>> save desired parts of each original data file as a formatted CSV-file for 
>> later importing into a database.
>>
>> As the plan is to use a bash shell script using curl to get the files, I 
>> have begun looking at external XML parsers that I can call from my script, 
>> perhaps specify which elements I want, get the data back in some kind of 
>> bash data structure and finally format and save as CSV-files.
>>
>> There seems to be a number of XML parsers available but perhaps someone on 
>> the list has a recommendation for which one might suit my needs best? I 
>> should add that I am running CentOS 7.
>
> Will you be using an XSLT stylesheet to do the work? There's a somewhat steep 
> learning curve, but in my experience it's the most reliable method for 
> parsing XML except in the very simplest of cases.
>
> In that case, the libxslt stuff may be what you want:
>
>   http://xmlsoft.org/libxslt/
>
> The command-line tool is xsltproc.
>
> Again, it's not easy to use, but once you've built a toolchain, it will be 
> reliable and fairly easy to modify if the source XML schema change.
>
I just checked and I cannot see that the organization publishing these data 
files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that 
the publisher of the data would be one with said stylesheet. (Although perhaps 
that is something an end-user could put together as well??)

Although the data format of each data series is unique, it is simple and could 
conceivably be parsed using grep but I am looking for a more "forward-looking" 
solution for other applications in the future.

If XSLT stylesheets are not available - would you suggest another tool? Or, 
would you suggest I design sheets, presumably one for for each data series?

___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] XML parsing in shell script

2021-03-18 Thread Paul Heinlein

On Thu, 18 Mar 2021, H wrote:


I have a challenge I am interested in getting feedback on.

I will on a regular basis download a series of data files from the 
web where the data is in XML-format. The format is known in advance 
but is different between the various data files. I then plan to 
extract the various data items ("elements?") from each data file, do 
some light formatting and then save desired parts of each original 
data file as a formatted CSV-file for later importing into a 
database.


As the plan is to use a bash shell script using curl to get the 
files, I have begun looking at external XML parsers that I can call 
from my script, perhaps specify which elements I want, get the data 
back in some kind of bash data structure and finally format and save 
as CSV-files.


There seems to be a number of XML parsers available but perhaps 
someone on the list has a recommendation for which one might suit my 
needs best? I should add that I am running CentOS 7.


Will you be using an XSLT stylesheet to do the work? There's a 
somewhat steep learning curve, but in my experience it's the most 
reliable method for parsing XML except in the very simplest of cases.


In that case, the libxslt stuff may be what you want:

  http://xmlsoft.org/libxslt/

The command-line tool is xsltproc.

Again, it's not easy to use, but once you've built a toolchain, it 
will be reliable and fairly easy to modify if the source XML schema 
change.


--
Paul Heinlein
heinl...@madboa.com
45.38° N, 122.59° W
___
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos