[jira] [Work logged] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

ASF GitHub Bot (Jira) Mon, 27 Sep 2021 00:43:10 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-15987?focusedWorklogId=655367&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-655367
 ]


ASF GitHub Bot logged work on HDFS-15987:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Sep/21 07:42
            Start Date: 27/Sep/21 07:42
    Worklog Time Spent: 10m 
      Work Description: whbing commented on a change in pull request #2918:
URL: https://github.com/apache/hadoop/pull/2918#discussion_r716436486



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/PBImageTextWriter.java
##########
@@ -822,4 +968,79 @@ public int getStoragePolicy(
     }
     return HdfsConstants.BLOCK_STORAGE_POLICY_ID_UNSPECIFIED;
   }
+
+  private ArrayList<FileSummary.Section> getINodeSubSections(
+      ArrayList<FileSummary.Section> sections) {
+    ArrayList<FileSummary.Section> subSections = new ArrayList<>();
+    Iterator<FileSummary.Section> iter = sections.iterator();
+    while (iter.hasNext()) {
+      FileSummary.Section s = iter.next();
+      if (SectionName.fromString(s.getName()) == SectionName.INODE_SUB) {
+        subSections.add(s);
+      }
+    }
+    return subSections;
+  }
+
+  /**
+   * Given a FSImage FileSummary.section, return a LimitInput stream set to
+   * the starting position of the section and limited to the section length.
+   * @param section The FileSummary.Section containing the offset and length
+   * @param compressionCodec The compression codec in use, if any
+   * @return An InputStream for the given section
+   * @throws IOException
+   */
+  private InputStream getInputStreamForSection(FileSummary.Section section,
+      String compressionCodec, Configuration conf)
+      throws IOException {
+    // channel of RandomAccessFile is not thread safe, use File
+    FileInputStream fin = new FileInputStream(filename);

Review comment:
       > Maybe it is unnecessary to change here.
   
   @ferhui  Thanks for the detailed review. 
   Considering the thread-safe channel, `RandomAccessFile` is avoided here. I 
haven't found a more elegant way for the time being.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 655367)
    Time Spent: 3.5h  (was: 3h 20m)

> Improve oiv tool to parse fsimage file in parallel with delimited format
> ------------------------------------------------------------------------
>
>                 Key: HDFS-15987
>                 URL: https://issues.apache.org/jira/browse/HDFS-15987
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Hongbing Wang
>            Assignee: Hongbing Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Improve_oiv_tool_001.pdf
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t <tmp> ...`) is as follows: 
> {code:java}
> 1) Loading string table:                 -> Not time consuming.
> 2) Loading inode references:             -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:      -> A bit time consuming (11%)
> 5) Output:                               -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -----------------------------------------
>  Threads TotalTime OutputTime MergeTime
>  1       18m37s     16m18s      –
>  4        8m7s      4m49s       41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

Reply via email to