[
https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743041#action_12743041
]
Koji Noguchi commented on MAPREDUCE-865:
----------------------------------------
I believe _masterindex is probably small enough to fit in memory(cache)
For _index file, 1 million files can correspond to _index size of 100MBytes.
(It depend on the path length)
Creating a local copy could be costly.
In our clusters, most of the files are mapreduce output files.
/a/b/part-00000
/a/b/part-00001
/a/b/part-00002
...
These show up as a set in _index file in this order since
HarFileSystem.getHarHash is written that way.
So instead of open->read->close _index for each part file, thinking of keeping
the index file open when possible.
> harchive: Reduce the number of open calls to _index and _masterindex
> ----------------------------------------------------------------------
>
> Key: MAPREDUCE-865
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: harchive
> Reporter: Koji Noguchi
> Priority: Minor
>
> When I have har file with 1000 files in it,
> % hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.