[ 
https://issues.apache.org/jira/browse/HADOOP-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Herr updated HADOOP-10921:
-----------------------------------

    Attachment: FixMapFileTest.java

> MapFile.fix fails silently when file is block compressed
> --------------------------------------------------------
>
>                 Key: HADOOP-10921
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10921
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: Johannes Herr
>         Attachments: FixMapFileTest.java
>
>
> MapFile provides a method 'fix' to reconstruct missing 'index' files. If the 
> 'data' file is block compressed the method will compute offsets that are to 
> large, which will lead to keys not being found in the mapfile. (See the 
> attached test case.)
> Tested against 0.20.2 but the trunk version looks like it has the same 
> problem.
> Cause of the problem is, that 'dataReader.getPosition()' is used to find the 
> offset to write for the next entry that should be indexed. When the file is 
> block compressed however 'dataReader.getPosition()' seems to return the  
> position of the next compressed block, not of block that contains the last 
> entry. This position will thus be to large in most cases and a seek operation 
> with this offset will incorrectly report the key as not present.
> I think its not obvious how to fix it, since the SequenceFile-Reader does not 
> provide the offset of the currently buffered entries. I've experimented with 
> watching the offset change and that seems to work mostly, but is quiet ugly 
> and not exact in edge cases.
> The method should probably throw an exception when the 'data' file is block 
> compressed instead of silently creating invalid files. A workaround for block 
> compressed files is to read the sequence file and write the entries to a new 
> mapfile and then replace the old file. This also avoids the problems 
> mentioned below.
> A few side notes: 
> 1. The 'index' files created by the fix-method are not block compressed 
> (which the 'index' files created by MapFile Writer always are, since the 
> 'index' file is read completely anyway).
> 2. The fix method does not index the first entry, the Writer does.
> 3. The header offset is not used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to