Ahmed Hussein created HADOOP-17362:
--------------------------------------
Summary: Doing hadoop ls on Har file triggers too many RPC calls
Key: HADOOP-17362
URL: https://issues.apache.org/jira/browse/HADOOP-17362
Project: Hadoop Common
Issue Type: Bug
Components: fs
Reporter: Ahmed Hussein
Assignee: Ahmed Hussein
[~daryn] has noticed that Invoking hadoop ls on HAR is taking too much of time.
The har system has multiple deficiencies that significantly impacted
performance:
# Parsing the master index references ranges within the archive index. Each
range required re-opening the hdfs input stream and seeking to the same
location where it previously stopped.
# Listing a har stats the archive index for every "directory". The per-call
cache used a unique key for each stat, rendering the cache useless and
significantly increasing memory pressure.
# Determining the children of a directory scans the entire archive contents and
filters out children. The cached metadata already stores the exact child list.
# Globbing a har's contents resulted in unnecessary stats for every leaf path.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]