Re: [jibx-users] Unmarshalling of large xml documents

Lennart Krakelin Thu, 28 Sep 2006 06:58:29 -0700

From: Krügler Daniel <[EMAIL PROTECTED]>
Reply-To: JiBX users <jibx-users@lists.sourceforge.net>
To: "JiBX users" <jibx-users@lists.sourceforge.net>
Subject: Re: [jibx-users] Unmarshalling of large xml documents
Date: Thu, 28 Sep 2006 08:19:12 +0200


> -----Original Message-----
> I need to process xml files with the following basic
> structure that in the worst case will be a couple of GB in
> size and contain on the order of ten million "Thing" elements
> (each of which has a few child elements and attributes a
> couple of levels deep that I want handle using JiBX):
>
> <Things>
>   <Thing ... />
>   <Thing ... />
>   ...
> </Things>
>
> Obviously, this whole structure won't fit in memory. I
> suppose I could do my own parsing to consume the top-level
> element and then programatically invoke JiBX's unmarshalling
> for each Thing, as suggested here:
> http://www.mail-archive.com/jibx-users@lists.sourceforge.net/m
sg00535.html.
> However, that would mean invoking JiBX several million times,
> and I'm not sure what that would mean in terms of overhead.
> Does it make any sense to do it like that? Or is there some
> other way I could approach this? Performance is not a big
> issue, by the way. I'm mostly interested in doing some
> transformations that JiBX's mappings should be well suited
> for, and to avoid having to chop up the files manually.

Besides the fact, that a database would be more reasonable for
such data sizes (compared to XML files), jibx allows you two
flavours for efficient reading of large element collections:

1) Use the collection approach in combination with a collection
factory and post-set hook method, including user-defined store-method
and iter-method hook functions as well as a single-element factory
hook. This is actually the most elegant way to realize what you
want because you can i.e. simulate a ring-buffered container, which
never hostes more than a pre-defined number of elements (This number
could be 1 as well)

2) Don't use collection, but use the normal structure element in
combination with allow-repeats, a factory and post-set hook method
for the collected element type. This ansatz is slightly shorter to
write but somewhat for intrusive concerning the requirements of
your data structures but in the same way efficient, because jibx
will only create a single instance of your element type to read
each and every of your million elements and via the called post-set
method you have the opportunity to do everything with your freshly read
elements.

Here are two prototypes of both approaches:

The prototype input xml

<whatever>
        <myitems>
          <item id="1" x="13" y="-111" intens="1234.12"/>
          <item id="2" x="111" y="0" intens="42.01"/>
          <item id="3" x="666"/>
          <item id="4" x="1038"  y="2"/>
          <item id="5" y="87" intens="3.1415"/>
        </myitems>
</whatever>

(1)

<binding name="testcoll">
        <mapping name="whatever" class="ItemContainer"
          factory="ItemReaderColl.containerFactory" post-set="postSet">
            <collection name="myitems" item-type="ItemData"
              store-method="addItem" iter-method="getIterator">
                        <structure name="item" value-style="attribute" 
type="ItemData"
                            factory="ItemContainer.newItem" usage="optional">
                          <value name="id" field="id"/>
                          <value name="x" field="x" usage="optional"/>
                          <value name="y" field="y" usage="optional"/>
                          <value name="intens" field="intensity" 
usage="optional"/>
                        </structure>
            </collection>
        </mapping>
</binding>

(2)

<binding name="testrepeat">
        <mapping name="whatever" class="ItemWrapper"
          factory="ItemReader.newItemWrapper" post-set="postSet">
            <structure name="myitems" ordered="false" allow-repeats="true">

<structure name="item" field="singleItem"factory="ItemWrapper.newItem"

                          post-set="postSet" usage="optional">
                    <structure map-as="ItemData"/>
                    </structure>
            </structure>
        </mapping>
        <mapping class="ItemData" abstract="true" value-style="attribute">
                <value name="id" field="id"/>
                <value name="x" field="x" usage="optional"/>
                <value name="y" field="y" usage="optional"/>
                <value name="intens" field="intensity" usage="optional"/>
        </mapping>
</binding>

Note that the map-as approach is not necessary, I only added this from anexisting use-case.


Greetings from Bremen,

Daniel Krügler

Thanks a lot! The collection approach worked like a charm. As for thesuitability of xml for this purpose, the data is delivered on this format,so I don't really have a choice in the matter.


/Lennart

_________________________________________________________________

FREE pop-up blocking with the new MSN Toolbar - get it now!http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
jibx-users mailing list
jibx-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jibx-users

Re: [jibx-users] Unmarshalling of large xml documents

Reply via email to