zhu created HDFS-16019:
--------------------------

             Summary: HDFS: Inode CheckPoint 
                 Key: HDFS-16019
                 URL: https://issues.apache.org/jira/browse/HDFS-16019
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: namanode
    Affects Versions: 3.3.1
            Reporter: zhu
            Assignee: zhu


*background*
The OIV IMAGE analysis tool has brought us many benefits, such as file size 
distribution, cold and hot data, abnormal growth directory analysis. But in my 
opinion he is too slow, especially the big IMAGE.
After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is 
necessary to load the entire IMAGE into the memory to output the inode 
information into a text format. For large IMAGE, this process takes a long time 
and consumes more resources and requires a large memory machine to analyze.
Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to get 
the old version of IMAGE through CheckPoint. The old IMAGE parsing does not 
require too many resources, but we need to parse the IMAGE again through the 
hdfs oiv_legacy command to get the text information of the Inode, which is 
relatively time-consuming.
**

*Solution*
We can ask the standby node to periodically check the Inode and serialize the 
Inode in text mode. For OutPut, different FileSystems can be used according to 
the configuration, such as the local file system or the HDFS file system.
The advantage of providing HDFS file system is that we can analyze Inode 
directly through spark/hive. I think the block information corresponding to the 
Inode may not be of much use. The size of the file and the number of copies are 
more useful to us.
In addition, the sequential output of the Inode is not necessary. We can speed 
up the CheckPoint for the Inode, and use the partition for the serialized Inode 
to output different files. Use a production thread to put Inode in the Queue, 
and use multi-threaded consumption Queue to write to different partition files. 
For output files, compression can also be used to reduce disk IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to