[ 
https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804715#action_12804715
 ] 

Hudson commented on HADOOP-6290:
--------------------------------

Integrated in Hadoop-Common-trunk #229 (See 
[http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/229/])
    

> AutoInputFormat + (larger) bzip2 files cause multiple runs over same file
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-6290
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6290
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.18.3
>            Reporter: Erik Forsberg
>
> Running a streaming job with the input directory containing a few .bzip2 
> files, each with a size of roughly 110MiB (compressed), with -inputformat
> org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, 
> each file is processed twice, i.e., if there are two bzip2 files in the 
> directory, four mappers will be run. 
> Running a wordcount M/R job, the resulting count is doubled which indicates 
> that each input file is analysed twice.
> This was discovered while trying out dumbo, which uses AutoInputFormat by 
> default. See 
> http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en
> It seems this can't be reproduced on small files. It is possible the file has 
> to be larger than the DFS blocksize, in my case set to 64MiB.
> I'm using Cloudera's hadoop distribution, version 
> 0.18.3-6cloudera0.3.0~intrepid.
> Please let me know if I need to provider further details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to