Re: A Unique XML Parsing Problem

Stefan Behnel Sun, 24 Oct 2010 00:03:47 -0700

Devon, 24.10.2010 01:40:

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.


I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.

> [...]

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file.  Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing.

That chapter is mostly out of date, and BeautifulSoup is certainly not theright tool for dealing with XML, both for performance and compliancereasons. If you need performance, as you stated above, look at cElementTreein the stdlib.

And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

If the rules are as simple as that (i.e. tag name + attribute name), it'llbe easy going with ElementTree. Don't put too much effort into separatingthe data from the XML format, though. XML parsing is fast and has the clearadvantage over CSV files that the data is safely stored in a well defined,expressive format, including character encoding and named data fields.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: A Unique XML Parsing Problem

Reply via email to