Mridul Muralidharan
Wed, 02 Jul 2008 13:40:23 -0700
Kayla Jay wrote:
I have two types of XML documents. 1) 1 big huge file that has many XML documents contained within
If there is some way to identify a document boundry - like if a document occupies only a single line, or the start tag is unique, etc.
The you can load based on that. Something like : in bind -> keep searching down the stream until you hit start tag. in getNext -> if past end of file split - return null. create document using sax, return the string representation of that.In one of the udf's I had to write, each document was in a single line, so I was just doing a line based reading. One thing to be kept in mind in bind() is that, you might already be at start of next tuple/document - so check for that before going down the stream to find end of document. [This is a bug that PigStorage() suffers from currently iirc when the filesplit is at a record boundary.]
2) Logs of mid-size individual XML documents in a directory.
In this case, you just need a simple udf (variant of above) - where you just read the document in its entirety from the stream and return its string form. If a file has only a single file, then you might want to also do if start offset != 0, return null always.
Based on number of documents, hadoop will do a distributed load of them in parallel using your loaders.
The difference between (1) and (2) is that in (2), there is a way to do a parallel load single file is a logical split of documents - while in (1) you have to identify the split points yourself.
I'm not exactly sure how to use Pig to find associations once I can parse an XML document.
Unfortunately, pig does not support complex types (even if Readable/Writable) - or even custom aggregations of simple types - the only aggregation supported are through its own primitives - tuple, bag, datamap, etc. [Simple types is coming in next ver - current ver only deals with string's though are some idioms which work on long/float/string like comparisons for example]
In your case, if you cant pull out the required info out as part of load (if you know all xpath's which are going to be applied to all documents) and return them, alternative is to return the string representation, serialize/deserialize the string/dom each time you want to apply xpath's on it/modify it in your custom filterfunc/evalfunc.
Others might have better ideas of working with these though ... the above is only based on my experience of using pig :-)
Regards, Mridul
----- Original Message ---- From: Mridul Muralidharan <[EMAIL PROTECTED]> To: pig-user@incubator.apache.org Sent: Tuesday, July 1, 2008 6:32:28 PM Subject: Re: Pig + xml ? Hi, What does the input look like ? - A single file with multiple xml documents ? - A directory containing a lot of individual xml files ?If the former, then it might be slightly tricky since you will need to identify when an xml document "ends" and another "begins". In simple cases, it could be an entire document in a single line - in which case, a simple line based loader will handle things fine. If you have a special root node in your schema, you can use that for finding the start of a document/end of document and parse that out as a document/sax/etc.Regards, Mridul Kayla Jay wrote:Hi Can you use Pig with XML data files? If so, does anyone have any examples? I want to do something that would equate to an XPath query against the XML. Thanks.