[ 
https://issues.apache.org/jira/browse/HDFS-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068374#comment-16068374
 ] 

Hudson commented on HDFS-11881:
-------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11950 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11950/])
HDFS-11881. NameNode consumes a lot of memory for snapshot diff report 
(weichiu: rev 16c8dbde574f49827fde5ee9add1861ee65d4645)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/snapshot/SnapshotDiffInfo.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestSnapshotCommands.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocolPB/PBHelperClient.java


> NameNode consumes a lot of memory for snapshot diff report generation
> ---------------------------------------------------------------------
>
>                 Key: HDFS-11881
>                 URL: https://issues.apache.org/jira/browse/HDFS-11881
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs, snapshots
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>             Fix For: 2.9.0, 3.0.0-alpha4, 2.8.2
>
>         Attachments: 1_ChunkedArrayList_SnapshotDiffReport.png, 
> 2_ArrayList_SnapshotDiffReport.png, HDFS-11881.01.patch
>
>
> *Problem:*
> HDFS supports a snapshot diff tool which can generate a [detailed report | 
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html#Get_Snapshots_Difference_Report]
>  of modified, created, deleted and renamed files between any 2 snapshots.
> {noformat}
> hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot>
> {noformat}
> However, if the diff list between 2 snapshots happens to be huge, in the 
> order of millions, then NameNode can consume a lot of memory while generating 
> the huge diff report. In a few cases, we are seeing NameNode getting into a 
> long GC lasting for few minutes to make room for this burst in memory 
> requirement during snapshot diff report generation.
> *RootCause:*
> * NameNode tries to generate the diff report with all diff entries at once 
> which puts undue pressure 
> * Each diff report entry has the diff type (enum), source path byte array, 
> and destination path byte array to the minimum. Let's take file deletions use 
> case. For file deletions, there would be only source or destination paths in 
> the diff report entry. Let's assume these deleted files on average take 
> 128Bytes for the path. 4 million file deletion captured in diff report will 
> thus need 512MB of memory 
> * The snapshot diff report uses simple java ArrayList which tries to double 
> its backing contiguous memory chunk every time the usage factor crosses the 
> capacity threshold. So, a 512MB memory requirement might be internally asking 
> for a much larger contiguous memory chunk
> *Proposal:*
> * Make NameNode snapshot diff report service follow the batch model (like 
> directory listing service). Clients (hdfs snapshotDiff command) will then 
> receive  diff report in small batches, and need to iterate several times to 
> get the full list.
> * Additionally, snap diff report service in the NameNode can make use of 
> ChunkedArrayList data structure instead of the current ArrayList so as to 
> avoid the curse of fragmentation and large contiguous memory requirement.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to