On Mon, Sep 26, 2011 at 12:24 PM, Richard Quadling <rquadl...@gmail.com>wrote:

> Hi.
> I've got a project which will be needing to iterate some very large
> XML files (around 250 files ranging in size from around 50MB to
> several hundred MB - 2 of them are in excess of 500MB).
> The XML files have a root node and then a collection of products. In
> total, in all the files, there are going to be several million product
> details. Each XML feed will have a different structure as it relates
> to a different source of data.
> I plan to have an abstract reader class with the concrete classes
> being extensions of this, each covering the specifics of the format
> being received and has the ability to return a standardised view of
> the data for importing into mysql and eventually MongoDB.
> I want to use an XML iterator so that I can say something along the lines
> of ...
> 1 - Instantiate the XML iterator with the XML's URL.
> 2 - Iterate the XML getting back one node at a time without keeping
> all the nodes in memory.
> e.g.
> <?php
> $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
> foreach($o_XML as $o_Product) {
>  // Process product.
> }
> Add to this that some of the xml feeds come .gz, I want to be able to
> stream the XML out of the .gz file without having to extract the
> entire file first.
> I've not got access to the XML feeds yet (they are coming from the
> various affiliate networks around, and I'm a remote user so need to
> get credentials and the like).
> If you have any pointers on the capabilities of the various XML reader
> classes, based upon this scenario, then I'd be very grateful.
> In this instance, the memory limitation is important. The current code
> is string based and whilst it works, you can imagine the complexity of
> it.
> The structure of each product internally will be different, but I will
> be happy to get back a nested array or an XML fragment, as long as the
> iterator is only holding onto 1 array/fragment at a time and not
> caching the massive number of products per file.
> Thanks.
> Richard.
> --
> Richard Quadling
> Twitter : EE : Zend : PHPDoc
> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
I believe the XMLReader allows you to pull node by node, and it's really
easy to work with:

In terms of dealing with various forms of compression, I believe you con use
the compression streams to handle this:


Nephtali:  A simple, flexible, fast, and security-focused PHP framework

Reply via email to