[jira] [Commented] (HDFS-14718) HttpFS: Sort response by key names as WebHDFS does

Siyao Meng (JIRA) Mon, 12 Aug 2019 22:16:31 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905829#comment-16905829
 ]


Siyao Meng commented on HDFS-14718:
-----------------------------------

[~jojochuang] Thanks for the comment.
(1) I don't think so. People access JSON objects with keys. If some one has 
been using index to access a object and has been using both WebHDFS and HttpFS, 
he would have noticed the different order.
(2) Theoretically using LinkedHashMap would have been faster by an O(1) factor 
- the slowness isn't related to N (number of files):

{code:title=FSOperations#toJson & toJsonInner, serializing FileStatuses to JSON 
for HttpFS LISTSTATUS response}
  private static Map<String, Object> toJson(FileStatus[] fileStatuses,
      boolean isFile) {
    Map<String, Object> json = new TreeMap<>();
    Map<String, Object> inner = new TreeMap<>();
    JSONArray statuses = new JSONArray();
    for (FileStatus f : fileStatuses) {
      statuses.add(toJsonInner(f, isFile));
    }
    inner.put(HttpFSFileSystem.FILE_STATUS_JSON, statuses);
    json.put(HttpFSFileSystem.FILE_STATUSES_JSON, inner);
    return json;
  }

  private static Map<String, Object> toJsonInner(FileStatus fileStatus,
      boolean emptyPathSuffix) {
    Map<String, Object> json = new TreeMap<String, Object>();
...
    json.put(HttpFSFileSystem.PATH_SUFFIX_JSON,
        (emptyPathSuffix) ? "" : fileStatus.getPath().getName());
...
  }
{code}

Note the for loop in *FSOperations#toJson* just inserts serializes each 
FileStatus entry to a plain *JSONArray*.
Inside *FSOperations#toJsonInner*, the number of entries to be inserted for 
each FileStatus entry is a constant (exactly 13 entries for HDFS, for now). 
Hence TreeMap will be slower. But it won't be much slower even if there are a 
million files for a LISTSTATUS request. Plus, WebHDFS is doing this already 
(sorting the inside entry order of each FileStatus).

My PR is just a POC for now. We do need to inspect each map change carefully. 
Also, I might just narrow down the scope of the jira back to only sort the 
order of LISTSTATUS entries inside each FileStatus.

> HttpFS: Sort response by key names as WebHDFS does
> --------------------------------------------------
>
>                 Key: HDFS-14718
>                 URL: https://issues.apache.org/jira/browse/HDFS-14718
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: httpfs
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>
> *Example*
> See description of HDFS-14665 for an example of LISTSTATUS.
> *Analysis*
> WebHDFS is [using a 
> TreeMap|https://github.com/apache/hadoop/blob/99bf1dc9eb18f9b4d0338986d1b8fd2232f1232f/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/web/JsonUtil.java#L120]
>  to serialize HdfsFileStatus, while HttpFS [uses a 
> LinkedHashMap|https://github.com/apache/hadoop/blob/6fcc5639ae32efa5a5d55a6b6cf23af06fc610c3/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L107]
>  to serialize FileStatus.
> *Questions*
> Why the difference? Is this intentional?
> - I looked into the Git history. It seems it's simply because WebHDFS uses 
> TreeMap from the beginning; and HttpFS uses LinkedHashMap from the beginning. 
> It is not only limited to LISTSTATUS, but ALL other request's JSON 
> serialization.
> Now the real question: Could/Should we replace ALL LinkedHashMap into TreeMap 
> in HttpFS serialization in FSOperations class?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-14718) HttpFS: Sort response by key names as WebHDFS does

Reply via email to