[
https://issues.apache.org/jira/browse/HDFS-14718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905829#comment-16905829
]
Siyao Meng commented on HDFS-14718:
-----------------------------------
[~jojochuang] Thanks for the comment.
(1) I don't think so. People access JSON objects with keys. If some one has
been using index to access a object and has been using both WebHDFS and HttpFS,
he would have noticed the different order.
(2) Theoretically using LinkedHashMap would have been faster by an O(1) factor
- the slowness isn't related to N (number of files):
{code:title=FSOperations#toJson & toJsonInner, serializing FileStatuses to JSON
for HttpFS LISTSTATUS response}
private static Map<String, Object> toJson(FileStatus[] fileStatuses,
boolean isFile) {
Map<String, Object> json = new TreeMap<>();
Map<String, Object> inner = new TreeMap<>();
JSONArray statuses = new JSONArray();
for (FileStatus f : fileStatuses) {
statuses.add(toJsonInner(f, isFile));
}
inner.put(HttpFSFileSystem.FILE_STATUS_JSON, statuses);
json.put(HttpFSFileSystem.FILE_STATUSES_JSON, inner);
return json;
}
private static Map<String, Object> toJsonInner(FileStatus fileStatus,
boolean emptyPathSuffix) {
Map<String, Object> json = new TreeMap<String, Object>();
...
json.put(HttpFSFileSystem.PATH_SUFFIX_JSON,
(emptyPathSuffix) ? "" : fileStatus.getPath().getName());
...
}
{code}
Note the for loop in *FSOperations#toJson* just inserts serializes each
FileStatus entry to a plain *JSONArray*.
Inside *FSOperations#toJsonInner*, the number of entries to be inserted for
each FileStatus entry is a constant (exactly 13 entries for HDFS, for now).
Hence TreeMap will be slower. But it won't be much slower even if there are a
million files for a LISTSTATUS request. Plus, WebHDFS is doing this already
(sorting the inside entry order of each FileStatus).
My PR is just a POC for now. We do need to inspect each map change carefully.
Also, I might just narrow down the scope of the jira back to only sort the
order of LISTSTATUS entries inside each FileStatus.
> HttpFS: Sort response by key names as WebHDFS does
> --------------------------------------------------
>
> Key: HDFS-14718
> URL: https://issues.apache.org/jira/browse/HDFS-14718
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: httpfs
> Reporter: Siyao Meng
> Assignee: Siyao Meng
> Priority: Major
>
> *Example*
> See description of HDFS-14665 for an example of LISTSTATUS.
> *Analysis*
> WebHDFS is [using a
> TreeMap|https://github.com/apache/hadoop/blob/99bf1dc9eb18f9b4d0338986d1b8fd2232f1232f/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java#L120]
> to serialize HdfsFileStatus, while HttpFS [uses a
> LinkedHashMap|https://github.com/apache/hadoop/blob/6fcc5639ae32efa5a5d55a6b6cf23af06fc610c3/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L107]
> to serialize FileStatus.
> *Questions*
> Why the difference? Is this intentional?
> - I looked into the Git history. It seems it's simply because WebHDFS uses
> TreeMap from the beginning; and HttpFS uses LinkedHashMap from the beginning.
> It is not only limited to LISTSTATUS, but ALL other request's JSON
> serialization.
> Now the real question: Could/Should we replace ALL LinkedHashMap into TreeMap
> in HttpFS serialization in FSOperations class?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]