[ 
https://issues.apache.org/jira/browse/HDFS-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900973#comment-13900973
 ] 

Haohui Mai commented on HDFS-5952:
----------------------------------

It is difficult for delimited processors and lsr to fully support snapshots, 
because the tools need to reload the full inode information into the memory. It 
could be infeasible for fsimages in productions (16G fsimages are quite common).

The motivation of delimited processor is to run data analysis on the fsimage. 
The design of the PB-based fsimage strives to flatten the hierarchy so that it 
is feasible to map the analysis problems into JOIN queries.

Therefore, there might be more values to create a tool that reads the PB format 
directly and dumps the data directly into Hive. Such a tool avoids converting 
data between protobuf, text, and the database format, which can significantly 
boost the efficiency of the analysis pipeline.

Putting the data also allows getting the stats with little amount of code. For 
example, the following query can check the usages of different users in a 
particular directory:

{code}
select sum(filesize) from inode where inode.parentId = 'foo' group by user
{code}

> Implement delimited processor in OfflineImageViewer
> ---------------------------------------------------
>
>                 Key: HDFS-5952
>                 URL: https://issues.apache.org/jira/browse/HDFS-5952
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: tools
>    Affects Versions: 3.0.0
>            Reporter: Akira AJISAKA
>            Assignee: Akira AJISAKA
>
> Delimited processor is not supported after HDFS-5698 was merged.
> The processor is useful for analyzing the output by scripts such as pig.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to