Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't such changes be reserved for major releases, i.e. for 0.7?
Nutch relies heavily on UTF8 being the default, this change will make
it more difficult to upgrade it to 0.6.2.
Good question. I think the intent was to switch as much as possible
from UTF8 to Text in 0.6. Lots of things were switched, but these
defaults were missed. So I was considering 0.6 the major release that
contains the change from UTF8 to Text in public APIs.
Hmm. Without having at least one official release where we have both
UTF8 and Text, and the API is compatible, there will be no easy way to
upgrade existing data. The latest release to offer this is 0.6.1, if I'm
not mistaken - or perhaps 0.5.x, if we consider changes to SequenceFile
format...?
If you consider users that collected terabytes of data using 0.6.1,
there must be a way for them to upgrade this data to whatever release
comes next. My thinking was that if we have a release that contains both
UTF8 and Text, we could write a converter, to be included in application
packages e.g. in Nutch in this specific release only.
Let's say I have data in SequenceFile-s and MapFile-s using 0.5.x
formats. How would I go about converting them from UTF8 to Text? Would
the current code read the data produced by 0.5.x code?
Right now, in 0.6, the default input format is not consistent
(TextInputFormat now returns Text, not UTF8). In our current monthly
release strategy, the .0 releases are effectively alphas, candidates
that sometimes are good enough to become the final release, and
sometimes require point releases.
A consistent alternative might be to revert other places where UTF8
was changed to Text.
http://issues.apache.org/jira/browse/HADOOP-450 (TextInputFormat)
http://issues.apache.org/jira/browse/HADOOP-499 (contrib/streaming)
http://issues.apache.org/jira/browse/HADOOP-460 (smallJobsBenchmark)
So should we revert these in 0.6?
This looks like a lot of work ... perhaps we should just burn bridges,
and make a 0.7.0 at this point, because it's definitely not API
compatible with 0.6.1.
As for Nutch ... it could be upgraded to 0.6.1. On the other hand, Nutch
is not compatible with 0.6.1 either, so perhaps it should be upgraded to
0.7.0 (plus a suitable converter for existing data).
I hate incompatible changes, but didn't see a way to make this change
compatibly, yet it seems like a good change. What do you think?
I propose to skip 0.6.2, and go directly to 0.7.0. And I would
appreciate any insights into the above questions about converting old
data ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com