Hi, I am interested in implementing record-aware file splitting for hadoop. I am looking for someone who knows the hadoop internals well and is willing to discuss some details of how to accomplish this.
By "record-aware file splitting", I mean that I want to be able to put files into hadoop with a custom InputFormat implementation, and hadoop will split the files into blocks such that no record is split between blocks. I believe that record-aware file splitting could offer considerable speedup when dealing with large records--say, 10s or 100s of megabytes per record--since it eliminates the need to stream part of a record from one datanode to another when said record is split between block boundaries. (The motivation here is that large records occur commonly when dealing with scientific datasets. Imagine, for example, a set of climate simulation data, where each "record" consists of climate data over the entire globe at a given time step. This is a huge amount of data per record. Essentially, I want to modify Hadoop to work faster with large scientific datasets.) If you are interested in discussing this with me, I would love to talk more with you. Thanks! Daren Hasenkamp Computer Science/Applied Mathematics, UC Berkeley Student Assistant, Lawrence Berkeley National Lab