[ https://issues.apache.org/jira/browse/HADOOP-6290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804715#action_12804715 ]
Hudson commented on HADOOP-6290: -------------------------------- Integrated in Hadoop-Common-trunk #229 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/229/]) > AutoInputFormat + (larger) bzip2 files cause multiple runs over same file > ------------------------------------------------------------------------- > > Key: HADOOP-6290 > URL: https://issues.apache.org/jira/browse/HADOOP-6290 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.18.3 > Reporter: Erik Forsberg > > Running a streaming job with the input directory containing a few .bzip2 > files, each with a size of roughly 110MiB (compressed), with -inputformat > org.apache.hadoop.streaming.AutoInputFormat on the streaming commandline, > each file is processed twice, i.e., if there are two bzip2 files in the > directory, four mappers will be run. > Running a wordcount M/R job, the resulting count is doubled which indicates > that each input file is analysed twice. > This was discovered while trying out dumbo, which uses AutoInputFormat by > default. See > http://groups.google.com/group/dumbo-user/browse_frm/thread/84b04b2320d4bbb0?hl=en > It seems this can't be reproduced on small files. It is possible the file has > to be larger than the DFS blocksize, in my case set to 64MiB. > I'm using Cloudera's hadoop distribution, version > 0.18.3-6cloudera0.3.0~intrepid. > Please let me know if I need to provider further details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.