[
https://issues.apache.org/jira/browse/HDFS-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangyi Zhu resolved HDFS-16019.
--------------------------------
Resolution: Later
> HDFS: Inode CheckPoint
> -----------------------
>
> Key: HDFS-16019
> URL: https://issues.apache.org/jira/browse/HDFS-16019
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namanode
> Affects Versions: 3.3.1
> Reporter: Xiangyi Zhu
> Assignee: Xiangyi Zhu
> Priority: Major
>
> *background*
> The OIV IMAGE analysis tool has brought us many benefits, such as file size
> distribution, cold and hot data, abnormal growth directory analysis. But in
> my opinion he is too slow, especially the big IMAGE.
> After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is
> necessary to load the entire IMAGE into the memory to output the inode
> information into a text format. For large IMAGE, this process takes a long
> time and consumes more resources and requires a large memory machine to
> analyze.
> Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to
> get the old version of IMAGE through CheckPoint. The old IMAGE parsing does
> not require too many resources, but we need to parse the IMAGE again through
> the hdfs oiv_legacy command to get the text information of the Inode, which
> is relatively time-consuming.
> **
> *Solution*
> We can ask the standby node to periodically check the Inode and serialize the
> Inode in text mode. For OutPut, different FileSystems can be used according
> to the configuration, such as the local file system or the HDFS file system.
> The advantage of providing HDFS file system is that we can analyze Inode
> directly through spark/hive. I think the block information corresponding to
> the Inode may not be of much use. The size of the file and the number of
> copies are more useful to us.
> In addition, the sequential output of the Inode is not necessary. We can
> speed up the CheckPoint for the Inode, and use the partition for the
> serialized Inode to output different files. Use a production thread to put
> Inode in the Queue, and use multi-threaded consumption Queue to write to
> different partition files. For output files, compression can also be used to
> reduce disk IO.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]