> > > For example, in your last MapReduce (MAPREDUCE-980) patch you >> added avro and paranamer as dependences. >> > > If I'm not mistaken, that only adds a dependency to the JobTracker. We > don't create specific classpaths for daemons than for user code, but we > probably should, so that things that only the daemon uses are not also > placed on the users classpath. > > +1 to separate classpaths for daemons as an eventual goal. As a user, I've definitely lost an afternoon to commons-lang version mismatches. If we can add fewer things to the Task classpath, that's fewer potential future lost afternoons.
I'm not a PMC member, but I suspect I spend more time doing user-level grunt work than many PMC members, so from that perspective: On internal-to-hadoop serialization: I'm going to spend 99% of my time not caring about these formats and the other 1% of the time needing to know what's going on with them *immediately*. Right now I know nothing about protobuf.. learning new things is always great but "while my production job is broken and I'm trying to debug it" isn't really going to be the best time and place for it. JSON on the other hand is human readable and never going to change. I feel a lot safer with JSON than with any binary format, especially considering that we could all be using NewHawtUnforeseenLibrary or IncompatibleWithPreviousReleaseLibrary for our binary serialization in a couple years. On packaging serialization lib dependencies: Again, additional versioned dependencies on the Task classpath scare me, and that goes double for serialization. I could see a couple ways around it that fall prey to the inner-framework antipattern, and for what it's worth, I'd be willing to accept that additional kludginess if it meant that I wasn't strictly dependent on avro x.x or thrift y.y. What if I'm reading a file that was encoded with an incompatible version? This gets way out of scope from the immediate issue but if I could ship my own serialization library in an assembly jar, and maybe override an additional method or supply a MapOutputEncoder or something, I'd take that tradeoff over being bound to a particular version until the next version of Hadoop comes out. If there were sensible defaults in place, it might not even mean more complexity for the average job.
