JobConf.java

Doug Cutting Fri, 15 Sep 2006 15:40:03 -0700

Andrzej Bialecki wrote:

If you consider users that collected terabytes of data using 0.6.1,there must be a way for them to upgrade this data to whatever releasecomes next. My thinking was that if we have a release that contains bothUTF8 and Text, we could write a converter, to be included in applicationpackages e.g. in Nutch in this specific release only.


UTF8 it still there, it's just deprecated and no longer the default.

Let's say I have data in SequenceFile-s and MapFile-s using 0.5.xformats. How would I go about converting them from UTF8 to Text? Wouldthe current code read the data produced by 0.5.x code?

SequenceFiles and MapFiles with UTF8 data can still be written and readjust fine, since SequenceFile names the classes of its keys and valuesin the file header. SequenceFile's format has changed, but the changeis back-compatible, i.e., 0.6 can read sequence files written by 0.5,but not vice-versa.

To convert code will require code changes, and code changes may also berequired to get things to run correctly, since TextInputFormat nowreturns Text instances rather than UTF8 instances. Code changes arealso required for things which relied on the default value types (mostlythings which also used TextInputFormat).

We could find no way to seamlessly upgrade UTF8 so that it was no longerlimited to less than 64k bytes, so decided it was better to make a clearbreak. This also permitted us to change to using real UTF-8, ratherthan Java's modified UTF-8 encoding.(http://issues.apache.org/jira/browse/HADOOP-302)

So should we revert these in 0.6?
This looks like a lot of work ... perhaps we should just burn bridges,and make a 0.7.0 at this point, because it's definitely not APIcompatible with 0.6.1.


But 0.6.0 has not yet been a stable release that folks could use.

As for Nutch ... it could be upgraded to 0.6.1. On the other hand, Nutchis not compatible with 0.6.1 either, so perhaps it should be upgraded to0.7.0 (plus a suitable converter for existing data).

I think we should release 0.6.2 with this patch, and update Nutch to usethat release. In general, we should probably not update Nutch untilHadoop releases are stable, which sometimes takes a week. This isisomorphic to calling it 0.7.0, but is more consistent with our monthlyreleases with bugfix point releases in the first week. With a monthlyrelease schedule we cannot afford to do a lot of testing (alphas, betas,etc.) before releases are made. If an incompatible change ishalf-completed, then I think it's reasonable to complete it as a bugfixrather than force a new major version number.


Doug

Re: svn commit: r446695 - in /lucene/hadoop/trunk: CHANGES.txt src/java/org/apache/hadoop/mapred/JobConf.java

Reply via email to