Devon, 24.10.2010 01:40:
I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
> [...]
I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file.  Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing.

That chapter is mostly out of date, and BeautifulSoup is certainly not the right tool for dealing with XML, both for performance and compliance reasons. If you need performance, as you stated above, look at cElementTree in the stdlib.


And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

If the rules are as simple as that (i.e. tag name + attribute name), it'll be easy going with ElementTree. Don't put too much effort into separating the data from the XML format, though. XML parsing is fast and has the clear advantage over CSV files that the data is safely stored in a well defined, expressive format, including character encoding and named data fields.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to