[
https://issues.apache.org/jira/browse/HDFS-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341865#comment-17341865
]
zhu commented on HDFS-16019:
----------------------------
[~weichiu] Thanks for your comments.
This improvement is different from NameNode Analytics. This improvement is to
omit the "image processing" part of the OIV tool, and use CheckPoint to
generate a text file that can be directly used for analysis. This improvement
does not provide complex image analysis functions. In addition, this function
does not require an additional server, just turn it on on the standby node.
> HDFS: Inode CheckPoint
> -----------------------
>
> Key: HDFS-16019
> URL: https://issues.apache.org/jira/browse/HDFS-16019
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namanode
> Affects Versions: 3.3.1
> Reporter: zhu
> Assignee: zhu
> Priority: Major
>
> *background*
> The OIV IMAGE analysis tool has brought us many benefits, such as file size
> distribution, cold and hot data, abnormal growth directory analysis. But in
> my opinion he is too slow, especially the big IMAGE.
> After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is
> necessary to load the entire IMAGE into the memory to output the inode
> information into a text format. For large IMAGE, this process takes a long
> time and consumes more resources and requires a large memory machine to
> analyze.
> Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to
> get the old version of IMAGE through CheckPoint. The old IMAGE parsing does
> not require too many resources, but we need to parse the IMAGE again through
> the hdfs oiv_legacy command to get the text information of the Inode, which
> is relatively time-consuming.
> **
> *Solution*
> We can ask the standby node to periodically check the Inode and serialize the
> Inode in text mode. For OutPut, different FileSystems can be used according
> to the configuration, such as the local file system or the HDFS file system.
> The advantage of providing HDFS file system is that we can analyze Inode
> directly through spark/hive. I think the block information corresponding to
> the Inode may not be of much use. The size of the file and the number of
> copies are more useful to us.
> In addition, the sequential output of the Inode is not necessary. We can
> speed up the CheckPoint for the Inode, and use the partition for the
> serialized Inode to output different files. Use a production thread to put
> Inode in the Queue, and use multi-threaded consumption Queue to write to
> different partition files. For output files, compression can also be used to
> reduce disk IO.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]