Re: Regarding loading a big XML file to HDFS

Steve Loughran Tue, 22 Nov 2011 03:20:37 -0800

On 22/11/11 07:33, Bejoy Ks wrote:

             Such a processing would hardly make sense while processing
complex xmls as xmls are based fully on parent child relation ship. (it
would work well for simple XMLs just having one level of hirearchy).


that is provided nobody is doing XML namespace declarations

<m1:vehicle xmlns:xml="uri:model1" xmlns="uri:model2>
 <car > ... </car>
</m1:vehicle>

In such a world the vehicle element name is the tuple ("uri:model1","vehicle") but that of the nested element is ("uri1:model2","car")

The way XML namespace handling is done implies the entire parent treeneeds to be parsed before you can be confident of the namespace which anXML element and attributes belong to.

Say

for example consider the mock XML like below

<Vehicle>
     <Car>
         <BMW>
             <Sedan>
                 <3-Series>
                     <min-torque></min-torque>
-----------------------------------------------------------------------------------------------------------------------------------
                     <max-torque></max-torque>
                 </3-Series
             <Sedan>
             <SUV>
             </SUV
         </BMW>
     </Car>
     <Truck>
     </Truck>
     <Bus>
     <Bus>
</Vehicle>

Even if we split it  in between(even if split happens at a line boundary)
it would be hard to process as the opening tags come in one block under one
mapper's boundary and the closing tags come in another block under another
mapper's boundary. So if we are mining some data from them it hardly makes
sense.

most record scans pull it a bit of trailing data from the next block;it's generally not very much and not worth worrying about. Collect somedata on average record length and assume that as your usual over-read.

We need to incorporate the logic in here interns of regex or so to
identify the closing tags from second block,

regexps which invariably contain assumptions about the encoding ofcontent within the XML document, break if the doctype is UTF-16 orsomething else, and are still namespace-brittle.

  May be one query remains, why use map reduce for XML if we can't exploit
parallel processing?

Why use XML for your persistent format if you can only parse it througha (stateful) recursive process, so limiting you to the bandwidth of yourparser accessing a single file?

- We can process multiple small xml files in parallel one in each mapper
without splitting to mine and extract some information for processing. But
we lose a good extent of data locality here.


no, you aggregate lots of small XML records into a HAR

Re: Regarding loading a big XML file to HDFS

Reply via email to