I wouldn't recommend writing directly from Flume to Parquet. Parquet can't guarantee that data is on disk until a file is closed, so you end up with long-running transactions that back up into your file channel. Plus, if you are writing to a partitioned dataset you end up with several open files and huge memory consumption. I recommend first writing to Avro and then using a batch job to convert into Parquet.

If you really need to write directly to Parquet, take a look at the Kite DatasetSink instead of using the HDFS sink. That allows you to write directly to Parquet.

rb

On 10/26/2015 11:29 PM, [email protected] wrote:

hi all,
     i want to convert the flume sink to the parquet format in the 
serialization, but the parquet writer constructor need a path parameter, while 
the flume serialization just provide a outputstream interface. i don't how to 
solve it. who can give me a sample ,thanks。


[email protected]



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to