Brandon McGinty wrote: > Hi All, > > My goal is to be able to read the www.gutenberg.org > <http://www.gutenberg.org/> rdf catalog, parse it into a python > structure, and pull out data for each record. > > The catalog is a Dublin core RDF/XML catalog, divided into sections for > each book and details for that book. > > I have done a very large amount of research on this problem. > > I’ve tried tools such as pyrple, sax/dom/minidom, and some others both > standard and nonstandard to a python installation. > > None of the tools has been able to read this file successfully, and > those that can even see the data can take up to half an hour to load > with 2 gb of ram. > > So you all know what I’m talking about, the file is located at: > > http://www.gutenberg.org/feeds/catalog.rdf.bz2 > > Does anyone have suggestions for a parser or converter, so I’d be able > to view this file, and extract data? > > Any help is appreciated. > Well, have you tried xml.etree.cElementTree, a part of the standard library since 2.5? Well worth a go, as it seems to outperform many XML libraries.
The iterparse function is your best bet, allowing you to iterate over the events as you parse the source, thus avoiding the need to build a huge in-memory data structure just to get the parsing done. The following program took about four minutes to run on my not-terribly up-to-date Windows laptop with 1.5 GB of memory with the pure Python version of ElementTree: import xml.etree.ElementTree as ET events = ET.iterparse(open("catalog.rdf")) count = 0 for e in events: count += 1 if count % 100000 == 0: print count print count, "total events" Here's an example output after I changed to using the extension module - by default, only the end-element events are reported. I think you'll be impressed by the timing. The only change was to the import staement, which now reads import xml.etree.cElementTree as ET [EMAIL PROTECTED] ~/Projects/Python $ time python test19.py 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1469971 total events real 0m11.145s user 0m10.124s sys 0m0.580s Good luck! regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden ------------------ Asciimercial --------------------- Get on the web: Blog, lens and tag your way to fame!! holdenweb.blogspot.com squidoo.com/pythonology tagged items: del.icio.us/steve.holden/python All these services currently offer free registration! -------------- Thank You for Reading ---------------- -- http://mail.python.org/mailman/listinfo/python-list