No, even with user defined Splits we don't need to use user code in the JobTracker if we make Split a Writable class that has the hosts array.
Split will write the hosts first, so in the JobTracker, when you get the byte array representing the Split, any fields from the sub class will follow the Split serialized bytes. The JobTracker can skip the Type in the bytes representing the serialized Split and then deserialize just a Split (ignoring the rest). You can make this process robust by putting a fingerprint at the beginning and end of the serialized part of Split, so that you can detect user defined Splits that change the serialization order. (This is another example of why Writable is cooler than Serializable. It would be really hard to just deserialize a super class from a serialized sub class using Java serialization.) You would ship the full byte array to the task trackers so that the InputFormats running in Childs can deserialize the full type. ben Owen O'Malley wrote: > > On Sep 29, 2006, at 12:20 AM, Benjamin Reed wrote: > >> I please correct me if I'm reading the code incorrectly, but it seems >> like submitJob puts the submitted job on the jobInitQueue which is >> immediately dequeued by the JobInitThread and then initTasks() will get >> the file splits and create Tasks. Thus, it doesn't seem like there is >> any difference in memory foot print. > > Agreed, it won't cost more memory. In fact, it will be less because we > won't have the init task thread running and creating InputFormats and > running user code. Of course, once we allow user-defined InputSplits > we will be back in exactly the same boat of running user-code on the > JobTracker, unless we also ship over the preferred hosts for each > InputFormat too. > > -- Owen