Johannes Herr created HADOOP-10921:
--------------------------------------
Summary: MapFile.fix fails silently when file is block compressed
Key: HADOOP-10921
URL: https://issues.apache.org/jira/browse/HADOOP-10921
Project: Hadoop Common
Issue Type: Bug
Affects Versions: 0.20.2
Reporter: Johannes Herr
MapFile provides a method 'fix' to reconstruct missing 'index' files. If the
'data' file is block compressed the method will compute offsets that are to
large, which will lead to keys not being found in the mapfile. (See the
attached test case.)
Tested against 0.20.2 but the trunk version looks like it has the same problem.
Cause of the problem is, that 'dataReader.getPosition()' is used to find the
offset to write for the next entry that should be indexed. When the file is
block compressed however 'dataReader.getPosition()' seems to return the
position of the next compressed block, not of block that contains the last
entry. This position will thus be to large in most cases and a seek operation
with this offset will incorrectly report the key as not present.
I think its not obvious how to fix it, since the SequenceFile-Reader does not
provide the offset of the currently buffered entries. I've experimented with
watching the offset change and that seems to work mostly, but is quiet ugly and
not exact in edge cases.
The method should probably throw an exception when the 'data' file is block
compressed instead of silently creating invalid files. A workaround for block
compressed files is to read the sequence file and write the entries to a new
mapfile and then replace the old file. This also avoids the problems mentioned
below.
A few side notes:
1. The 'index' files created by the fix-method are not block compressed (which
the 'index' files created by MapFile Writer always are, since the 'index' file
is read completely anyway).
2. The fix method does not index the first entry, the Writer does.
3. The header offset is not used.
--
This message was sent by Atlassian JIRA
(v6.2#6252)