<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I wrote a program that takes an XML file into memory using Minidom. I > found out that the XML document is 10gb. > > I clearly need SAX or something else? >
You clearly need something instead of XML. This sounds like a case where a prototype, which worked for the developer's simple test data set, blows up in the face of real user/production data. XML adds lots of overhead for nested structures, when in fact, the actual meat of the data can be relatively small. Note also that this XML overhead is directly related to the verbosity of the XML designer's choice of tag names, and whether the designer was predisposed to using XML elements over attributes. Imagine a record structure for a 3D coordinate point (described here in no particular coding language): struct ThreeDimPoint: xValue : integer, yValue : integer, zValue : integer Directly translated to XML gives: <ThreeDimPoint> <xValue>4</xValue> <yValue>5</yValue> <zValue>6</zValue> </ThreeDimPoint> This expands 3 integers to a whopping 101 characters. Throw in namespaces for good measure, and you inflate the data even more. Many Java folks treat XML attributes as anathema, but look how this cuts down the data inflation: <ThreeDimPoint xValue="4" yValue="5" zValue="6"/> This is only 50 characters, or *only* 4 times the size of the contained data (assuming 4-byte integers). Try zipping your 10Gb file, and see what kind of compression you get - I'll bet it's close to 30:1. If so, convert the data to a real data storage medium. Even a SQLite database table should do better, and you can ship it around just like a file (just can't open it up like a text file). -- Paul -- http://mail.python.org/mailman/listinfo/python-list