[
https://issues.apache.org/jira/browse/FLUME-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491442#comment-13491442
]
alex gemini edited comment on FLUME-1669 at 11/6/12 1:29 PM:
-------------------------------------------------------------
usually the sequence file or avro file format block size is quite small,the
columnar format will only get benefit when block size is quite large usually a
few GB is minimum ,see the trenvi spec "Desing" Section line 2. It's not
practical to hold that too much data in memory considering service crash or
reload configuration .It's better write sequence or avro file format to a
directory then after some point merge this directory to columnar format when
flume rolling to the next directory .another thing should be noticed is
currently the query engine (hive,pig and others) didn't support one directory
contains two different file format, but hive support one table contain two
partition with different file format .So I think maybe flume should monitor two
dictionary,one for currently writing dictionary,it will write small avro or
sequence format with multiple writer, when data stream rolling to next,flume
will merge this avro or sequence file format to trenvi columnar format using
MR,In the MR processing time,the columnar directory will in temp space,only if
whole directory success convert,the previous row format directory will delete
and new columnar directory will rename to previous row format directory
name.I'm not quite sure this was a clear expression,feel free to add comment.
was (Author: gemini5201314):
usually the sequence file or avro file format block size is quite small,the
columnar format will only get benefit when block size is quite large usually a
few GB is minimum ,see the trenvi spec "Desing" Section line 2. It's not
practical to hold that too much data in memory considering service crash or
reload configuration .It's better write sequence or avro file format to a
directory then after some point merge this directory to columnar format when
flume rolling to the next directory .another thing should be noticed is
currently the query engine (hive,pig and others) didn't support one directory
contains two different file format, but hive support one table contain two
partition with different file format .So I think maybe flume should monitor two
dictionary,one for currently writing dictionary,it will write small avro or
sequence format with multiple writer, when data stream rolling to next,flume
will merge this avro or sequence file format to trenvi columnar format maybe
using mr.
> Add support for columnar event serializer in HDFS
> -------------------------------------------------
>
> Key: FLUME-1669
> URL: https://issues.apache.org/jira/browse/FLUME-1669
> Project: Flume
> Issue Type: New Feature
> Components: Sinks+Sources
> Reporter: Mubarak Seyed
> Assignee: Mubarak Seyed
> Labels: noob
> Fix For: v1.4.0
>
>
> Motivation:
> Columnar storage is preferred for better performance and compression for
> low-latency analytical workloads. Avro 1.7.2 supports column-major file
> format [1]
> and we can implement {{AbstractTrevniAvroEventSerializer}} (as like
> {{AbstractAvroEventSerializer}}). {{HDFSSink}} can have serializer type to
> store events in Trevni column-major file format.
> [1] http://avro.apache.org/docs/current/trevni/spec.html
> https://github.com/cutting/trevni
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira