> -----Original Message----- > I need to process xml files with the following basic > structure that in the worst case will be a couple of GB in > size and contain on the order of ten million "Thing" elements > (each of which has a few child elements and attributes a > couple of levels deep that I want handle using JiBX): > > <Things> > <Thing ... /> > <Thing ... /> > ... > </Things> > > Obviously, this whole structure won't fit in memory. I > suppose I could do my own parsing to consume the top-level > element and then programatically invoke JiBX's unmarshalling > for each Thing, as suggested here: > http://www.mail-archive.com/jibx-users@lists.sourceforge.net/m sg00535.html. > However, that would mean invoking JiBX several million times, > and I'm not sure what that would mean in terms of overhead. > Does it make any sense to do it like that? Or is there some > other way I could approach this? Performance is not a big > issue, by the way. I'm mostly interested in doing some > transformations that JiBX's mappings should be well suited > for, and to avoid having to chop up the files manually.
Besides the fact, that a database would be more reasonable for such data sizes (compared to XML files), jibx allows you two flavours for efficient reading of large element collections: 1) Use the collection approach in combination with a collection factory and post-set hook method, including user-defined store-method and iter-method hook functions as well as a single-element factory hook. This is actually the most elegant way to realize what you want because you can i.e. simulate a ring-buffered container, which never hostes more than a pre-defined number of elements (This number could be 1 as well) 2) Don't use collection, but use the normal structure element in combination with allow-repeats, a factory and post-set hook method for the collected element type. This ansatz is slightly shorter to write but somewhat for intrusive concerning the requirements of your data structures but in the same way efficient, because jibx will only create a single instance of your element type to read each and every of your million elements and via the called post-set method you have the opportunity to do everything with your freshly read elements. Here are two prototypes of both approaches: The prototype input xml <whatever> <myitems> <item id="1" x="13" y="-111" intens="1234.12"/> <item id="2" x="111" y="0" intens="42.01"/> <item id="3" x="666"/> <item id="4" x="1038" y="2"/> <item id="5" y="87" intens="3.1415"/> </myitems> </whatever> (1) <binding name="testcoll"> <mapping name="whatever" class="ItemContainer" factory="ItemReaderColl.containerFactory" post-set="postSet"> <collection name="myitems" item-type="ItemData" store-method="addItem" iter-method="getIterator"> <structure name="item" value-style="attribute" type="ItemData" factory="ItemContainer.newItem" usage="optional"> <value name="id" field="id"/> <value name="x" field="x" usage="optional"/> <value name="y" field="y" usage="optional"/> <value name="intens" field="intensity" usage="optional"/> </structure> </collection> </mapping> </binding> (2) <binding name="testrepeat"> <mapping name="whatever" class="ItemWrapper" factory="ItemReader.newItemWrapper" post-set="postSet"> <structure name="myitems" ordered="false" allow-repeats="true"> <structure name="item" field="singleItem" factory="ItemWrapper.newItem" post-set="postSet" usage="optional"> <structure map-as="ItemData"/> </structure> </structure> </mapping> <mapping class="ItemData" abstract="true" value-style="attribute"> <value name="id" field="id"/> <value name="x" field="x" usage="optional"/> <value name="y" field="y" usage="optional"/> <value name="intens" field="intensity" usage="optional"/> </mapping> </binding> Note that the map-as approach is not necessary, I only added this from an existing use-case. Greetings from Bremen, Daniel Krügler ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ jibx-users mailing list jibx-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/jibx-users