Hi,

I am interested in implementing record-aware file splitting for hadoop. I
am looking for someone who knows the hadoop internals well and is willing
to discuss some details of how to accomplish this.

By "record-aware file splitting", I mean that I want to be able to put
files into hadoop with a custom InputFormat implementation, and hadoop
will split the files into blocks such that no record is split between
blocks.

I believe that record-aware file splitting could offer considerable
speedup when dealing with large records--say, 10s or 100s of megabytes per
record--since it eliminates the need to stream part of a record from one
datanode to another when said record is split between block boundaries.

(The motivation here is that large records occur commonly when dealing
with scientific datasets. Imagine, for example, a set of climate
simulation data, where each "record" consists of climate data over the
entire globe at a given time step. This is a huge amount of data per
record. Essentially, I want to modify Hadoop to work faster with large
scientific datasets.)

If you are interested in discussing this with me, I would love to talk
more with you.

Thanks!
Daren Hasenkamp
Computer Science/Applied Mathematics, UC Berkeley
Student Assistant, Lawrence Berkeley National Lab

Reply via email to