[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Rodrigo Schmidt (JIRA) Thu, 27 May 2010 14:50:07 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872392#action_12872392
 ]


Rodrigo Schmidt commented on MAPREDUCE-1752:
--------------------------------------------

I've been following this discussion.

I think Dmytro's idea makes a lot of sense, specially for big jobs that read 
from big files. In such cases, the performance gains in having local reads 
would easily compensate for the extra delay at setup time.

The idea behind it is to use files stored in hadoop archives as input for 
mapreduce jobs. I don't think this method will be used elsewhere.

Using har to store mapreduce files that are stable (won't change anymore) but 
still necessary for read queries is a huge win for the namenode scalability, 
since it reduces the number of objects it has to store in memory.

> Implement getFileBlockLocations in HarFilesystem
> ------------------------------------------------
>
>                 Key: MAPREDUCE-1752
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1752
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Dmytro Molkov
>            Assignee: Dmytro Molkov
>             Fix For: 0.22.0
>
>         Attachments: MR-1752.patch
>
>
> To efficiently run map reduce on the data that has been HAR'ed it will be 
> great to actually implement getFileBlockLocations for a given filename.
> This way the JobTracker will have information about data locality and will 
> schedule tasks appropriately.
> I believe the overhead introduced by doing lookups in the index files can be 
> smaller than that of copying data over the wire.
> Will upload the patch shortly, but would love to get some feedback on this. 
> And any ideas on how to test it are very welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1752) Implement getFileBlockLocations in HarFilesystem

Reply via email to