[
https://issues.apache.org/jira/browse/AVRO-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556685#comment-13556685
]
Dave Beech commented on AVRO-1234:
----------------------------------
Well, the fact the old and new APIs differ in their behaviour is clearly not
ideal. But I consider this a bugfix that hasn't been backported rather than a
regression :)
If I had a directory containing a mixture of Avro and non-Avro files, and I
gave that path to AvroInputFormat to process, I'd fully expect the job to die a
horrible death. Silently discarding input feels wrong, especially so if it's
valid Avro which just happens to be named a certain way. A magic bytes check
would be better, but on the whole I'm just not sure a check is necessary. As
far as I'm aware, none of the standard Hadoop input formats behave in this way
(happy to be corrected as I haven't checked them all!).
> Avro MapReduce jobs silently ignore input data without '.avro' extension
> ------------------------------------------------------------------------
>
> Key: AVRO-1234
> URL: https://issues.apache.org/jira/browse/AVRO-1234
> Project: Avro
> Issue Type: Bug
> Affects Versions: 1.7.3
> Reporter: Dave Beech
> Assignee: Dave Beech
> Attachments: AVRO-1234.patch
>
>
> The AvroInputFormat class explicitly checks each input path for a '.avro'
> extension.
> If only some of the input paths have the correct extension, the remainder are
> silently ignored and not included in the job. However, if none of the input
> paths have the extension, the job will continue and succeed even though no
> map tasks are allocated, and no work is done.
> This only happens using the old mapred API. The new mapreduce API version
> will happily read files regardless of extension.
> Is the check necessary?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira