[jira] [Comment Edited] (FLUME-1669) Add support for columnar event serializer in HDFS

alex gemini (JIRA) Tue, 06 Nov 2012 05:30:16 -0800

    [ 
https://issues.apache.org/jira/browse/FLUME-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491442#comment-13491442
 ]


alex gemini edited comment on FLUME-1669 at 11/6/12 1:29 PM:
-------------------------------------------------------------

usually the sequence file or avro file format block size is quite small,the 
columnar format will only get benefit when block size is quite large usually a 
few GB is minimum ,see the trenvi spec "Desing" Section line 2. It's not 
practical to hold that too much data in memory considering service crash or 
reload configuration .It's better write sequence or avro file format to a 
directory then after some point merge this directory to columnar format when 
flume rolling to the next directory .another thing should be noticed is 
currently the query engine (hive,pig and others) didn't support one directory 
contains two different file format, but hive support one table contain two 
partition with different file format .So I think maybe flume should monitor two 
dictionary,one for currently writing dictionary,it will write small avro or 
sequence format with multiple writer, when data stream rolling to next,flume 
will merge this avro or sequence file format to trenvi columnar format using 
MR,In the MR processing time,the columnar directory will in temp space,only if 
whole directory success convert,the previous row format directory will delete 
and new columnar directory will rename to previous row format directory 
name.I'm not quite sure this was a clear expression,feel free to add comment.
                
      was (Author: gemini5201314):
    usually the sequence file or avro file format block size is quite small,the 
columnar format will only get benefit when block size is quite large usually a 
few GB is minimum ,see the trenvi spec "Desing" Section line 2. It's not 
practical to hold that too much data in memory considering service crash or 
reload configuration .It's better write sequence or avro file format to a 
directory then after some point merge this directory to columnar format when 
flume rolling to the next directory .another thing should be noticed is 
currently the query engine (hive,pig and others) didn't support one directory 
contains two different file format, but hive support one table contain two 
partition with different file format .So I think maybe flume should monitor two 
dictionary,one for currently writing dictionary,it will write small avro or 
sequence format with multiple writer, when data stream rolling to next,flume 
will merge this avro or sequence file format to trenvi columnar format maybe 
using mr.
                  
> Add support for columnar event serializer in HDFS
> -------------------------------------------------
>
>                 Key: FLUME-1669
>                 URL: https://issues.apache.org/jira/browse/FLUME-1669
>             Project: Flume
>          Issue Type: New Feature
>          Components: Sinks+Sources
>            Reporter: Mubarak Seyed
>            Assignee: Mubarak Seyed
>              Labels: noob
>             Fix For: v1.4.0
>
>
> Motivation:
> Columnar storage is preferred for better performance and compression for 
> low-latency analytical workloads. Avro 1.7.2 supports column-major file 
> format [1]
> and we can implement {{AbstractTrevniAvroEventSerializer}} (as like 
> {{AbstractAvroEventSerializer}}). {{HDFSSink}} can have serializer type to 
> store events in Trevni column-major file format.
> [1]    http://avro.apache.org/docs/current/trevni/spec.html
>        https://github.com/cutting/trevni

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (FLUME-1669) Add support for columnar event serializer in HDFS

Reply via email to