[
https://issues.apache.org/jira/browse/HADOOP-6901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rick Weber resolved HADOOP-6901.
--------------------------------
Resolution: Abandoned
Marking as abandoned. Issues is 14 years old and Dumbo usage is no longer and
issue/problem.
> Parsing large compressed files with HADOOP-1722 spawns multiple mappers per
> file
> --------------------------------------------------------------------------------
>
> Key: HADOOP-6901
> URL: https://issues.apache.org/jira/browse/HADOOP-6901
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.21.0
> Environment: Hadoop v0.20.2 + HADOOP-1722
> Reporter: Rick Weber
> Priority: Major
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> This was originally discovered while using Dumbo to parse a very large (2G)
> compressed file. By default, Dumbo will attempt to use the AutoInputFormat
> as the input format.
> Here is my use case:
> I have a large (2Gb) compressed file. It's compressed using the default
> method, which is Gzip based and is unsplittable. Due to the size, the
> default implementation of AutoInputFormat says that this file is splittable.
> As a result, this file is split into about 35 parts, and each one is assigned
> to a Map job.
> However, since the file itself is unsplittable, each Map job winds up parsing
> the file again from the beginning. This basically means my job has 35x the
> data, and takes 35x long to run.
> If I set "-inputformat text", this problem does not appear in dumbo. If I
> manually call the streaming jar and use AutoInputFormat, this
> problem appears.
> Looking at the code in AutoInputFormat, it appears that it uses the default
> isSplittable() method from InputFormat, which indicates everything is
> splittable. I think that this class should define it's own isSplittable
> method similar to what is defined in the TextInputFormat class, which
> basically says it's splittable if it's not compressed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]