In Jira Pig-1077, we Zebra team plan to utilize Hadoop TFile's split by
record sequence number support to provide record(row)-based input split
support in Zebra.
Here we would like to point out that: along the way we plan to also
resolve the dependency issue that Zebra record-based split needs Hadoop
TFile split support to work. For this dependency, Zebra has to maintain
its own copy of Hadoop jar in svn for it to be able to build.
Furthermore, the fact that Zebra currently sits inside Pig in svn and
Pig itself maintains its own copy of Hadoop jar in lib directory makes
things even messier. Finally, we notice that Zebra is new and making
many changes and needs to get new revisions quickly, while Hadoop and
Pig are more mature and moving slowly and thus can't make new releases
for Zebra all the time.
After carefully thinking through all this, we plan to fork the TFile
part off the Hadoop and port it into Zebra's own code base. This will
greatly simply the building process of Zebra and also enable it to make
Last, we would like to point out that this is a short term solution for
Zebra and we plan to:
1) port all changes to Zebra TFile back into Hadoop TFile.
2) in the long run have a single unified solution for this.
For more information, please see
Welcome your feedback on this.