Am 19.03.21 um 17:40 schrieb Fabian Arrotin:
On 18/03/2021 22:08, H wrote:
On 03/18/2021 04:30 PM, Paul Heinlein wrote:
On Thu, 18 Mar 2021, H wrote:

I have a challenge I am interested in getting feedback on.

I will on a regular basis download a series of data files from the web where the data is 
in XML-format. The format is known in advance but is different between the various data 
files. I then plan to extract the various data items ("elements?") from each 
data file, do some light formatting and then save desired parts of each original data 
file as a formatted CSV-file for later importing into a database.

As the plan is to use a bash shell script using curl to get the files, I have 
begun looking at external XML parsers that I can call from my script, perhaps 
specify which elements I want, get the data back in some kind of bash data 
structure and finally format and save as CSV-files.

There seems to be a number of XML parsers available but perhaps someone on the 
list has a recommendation for which one might suit my needs best? I should add 
that I am running CentOS 7.

Will you be using an XSLT stylesheet to do the work? There's a somewhat steep 
learning curve, but in my experience it's the most reliable method for parsing 
XML except in the very simplest of cases.

In that case, the libxslt stuff may be what you want:

   http://xmlsoft.org/libxslt/

The command-line tool is xsltproc.

Again, it's not easy to use, but once you've built a toolchain, it will be 
reliable and fairly easy to modify if the source XML schema change.

I just checked and I cannot see that the organization publishing these data 
files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that 
the publisher of the data would be one with said stylesheet. (Although perhaps 
that is something an end-user could put together as well??)

Although the data format of each data series is unique, it is simple and could 
conceivably be parsed using grep but I am looking for a more "forward-looking" 
solution for other applications in the future.

If XSLT stylesheets are not available - would you suggest another tool? Or, 
would you suggest I design sheets, presumably one for for each data series?


I used in the past xmlstarlet (available in epel) for quick parsing from
within bash scripts.
For something more robust, maybe switch to python ? (ymmv)




just for a value grep use xmllint (its in libxml2 package):

Example:

XML input:

<?xml version="1.0" encoding="utf-8" ?><methodResponse><params><param><value><string>OK</string></value></param></params></methodResponse>


bash var:

STATUS=$(echo ${RESPONSE} | xmllint --format --xpath "//methodResponse/params/param/value/string/text()" - 2>/dev/null)


--
Leon

_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to