[
https://issues.apache.org/jira/browse/HADOOP-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Johannes Herr updated HADOOP-10921:
-----------------------------------
Attachment: FixMapFileTest.java
> MapFile.fix fails silently when file is block compressed
> --------------------------------------------------------
>
> Key: HADOOP-10921
> URL: https://issues.apache.org/jira/browse/HADOOP-10921
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: Johannes Herr
> Attachments: FixMapFileTest.java
>
>
> MapFile provides a method 'fix' to reconstruct missing 'index' files. If the
> 'data' file is block compressed the method will compute offsets that are to
> large, which will lead to keys not being found in the mapfile. (See the
> attached test case.)
> Tested against 0.20.2 but the trunk version looks like it has the same
> problem.
> Cause of the problem is, that 'dataReader.getPosition()' is used to find the
> offset to write for the next entry that should be indexed. When the file is
> block compressed however 'dataReader.getPosition()' seems to return the
> position of the next compressed block, not of block that contains the last
> entry. This position will thus be to large in most cases and a seek operation
> with this offset will incorrectly report the key as not present.
> I think its not obvious how to fix it, since the SequenceFile-Reader does not
> provide the offset of the currently buffered entries. I've experimented with
> watching the offset change and that seems to work mostly, but is quiet ugly
> and not exact in edge cases.
> The method should probably throw an exception when the 'data' file is block
> compressed instead of silently creating invalid files. A workaround for block
> compressed files is to read the sequence file and write the entries to a new
> mapfile and then replace the old file. This also avoids the problems
> mentioned below.
> A few side notes:
> 1. The 'index' files created by the fix-method are not block compressed
> (which the 'index' files created by MapFile Writer always are, since the
> 'index' file is read completely anyway).
> 2. The fix method does not index the first entry, the Writer does.
> 3. The header offset is not used.
--
This message was sent by Atlassian JIRA
(v6.2#6252)