On 22/11/11 07:33, Bejoy Ks wrote:
Such a processing would hardly make sense while processing
complex xmls as xmls are based fully on parent child relation ship. (it
would work well for simple XMLs just having one level of hirearchy).
that is provided nobody is doing XML namespace declarations
<m1:vehicle xmlns:xml="uri:model1" xmlns="uri:model2>
<car > ... </car>
</m1:vehicle>
In such a world the vehicle element name is the tuple ("uri:model1",
"vehicle") but that of the nested element is ("uri1:model2","car")
The way XML namespace handling is done implies the entire parent tree
needs to be parsed before you can be confident of the namespace which an
XML element and attributes belong to.
Say
for example consider the mock XML like below
<Vehicle>
<Car>
<BMW>
<Sedan>
<3-Series>
<min-torque></min-torque>
-----------------------------------------------------------------------------------------------------------------------------------
<max-torque></max-torque>
</3-Series
<Sedan>
<SUV>
</SUV
</BMW>
</Car>
<Truck>
</Truck>
<Bus>
<Bus>
</Vehicle>
Even if we split it in between(even if split happens at a line boundary)
it would be hard to process as the opening tags come in one block under one
mapper's boundary and the closing tags come in another block under another
mapper's boundary. So if we are mining some data from them it hardly makes
sense.
most record scans pull it a bit of trailing data from the next block;
it's generally not very much and not worth worrying about. Collect some
data on average record length and assume that as your usual over-read.
We need to incorporate the logic in here interns of regex or so to
identify the closing tags from second block,
regexps which invariably contain assumptions about the encoding of
content within the XML document, break if the doctype is UTF-16 or
something else, and are still namespace-brittle.
May be one query remains, why use map reduce for XML if we can't exploit
parallel processing?
Why use XML for your persistent format if you can only parse it through
a (stateful) recursive process, so limiting you to the bandwidth of your
parser accessing a single file?
- We can process multiple small xml files in parallel one in each mapper
without splitting to mine and extract some information for processing. But
we lose a good extent of data locality here.
no, you aggregate lots of small XML records into a HAR