Review Request: Improve Scalability of the XMLLoader for large datasets such as wikipedia

Vivek Padmanabhan Sun, 20 Feb 2011 22:44:28 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/438/
-----------------------------------------------------------


Review request for pig.


Summary
-------

The current XMLLoader for Pig, does not work well for large datasets such as 
the wikipedia dataset. Each mapper reads in the entire XML file resulting in 
extermely slow run times.

The below are some of the issues addressed in the patch :
a) Marking splittable of the loader as true except for gz formats
a) Changing XMLLoader to read for splits rather than entire file.
b) Handling scenarios regarding split/record boundaries
c) Using CBZip2InputStream to handle bzip2 files
d) An improvement on logic of collectTag (ie, skip unnecessary reads to find 
end tag if no start tags are found)


This addresses bug PIG-1842.
    https://issues.apache.org/jira/browse/PIG-1842


Diffs
-----


Diff: https://reviews.apache.org/r/438/diff


Testing
-------


Thanks,

Vivek

Review Request: Improve Scalability of the XMLLoader for large datasets such as wikipedia

Reply via email to