Hi all,

I just started using the Hadoop DFS last night, and it has already solved a big performance problem we were having with throughput from our shared NFS storage. Thanks for everyone who has contributed to that project.

I wrote my own MapReduce implementation, because I needed two features that Hadoop didn't have: Grid Engine integration and easy record I/O (described below). I'm writing this message to see if you're interested in these ideas for Hadoop, and to see what ideas I might learn from you.

Grid Engine: All the machines available to me run Sun's Grid Engine for job submission. Grid Engine is important for us, because it makes sure that all of the users of a cluster get their fair share of resources--as far as I can tell, the JobTracker assumes that one user owns the machines. Is this shared scenario you're interested in supporting? Would you consider supporting job submission systems like Grid Engine or Condor?

Record I/O: My implementation is something like org.apache.hadoop.record implementation, but with a couple of twists. In my implementation, you give the system a simple Java class, like this:

public class WordCount {
        public String word;
        public long count;
}

and my TypeBuilder class generates code for all possible orderings of this class (order by word, order by count, order by word then count, order by count then word). Each ordering has its own hash function and comparator.

In addition, each ordering has its own serialization/deserialization code. For example, if we order by count, the serialization code stores only differences between adjacent counts to help with compression.

All this code is grouped into an Order object, which is accessed like this:
        String[] fields = { "word" };
        Order<WordCount> order = (new WordCountType()).getOrder( fields );
This order object contains a hash function, a comparator, and serialization logic for ordering WordCount objects by word.

Is this code you'd be interested in?

Thanks,

Trevor

(by the way, Doug, you may remember me from a panel at the OSIR workshop this year on open source search)

Reply via email to