I wouldn't recommend writing directly from Flume to Parquet. Parquet
can't guarantee that data is on disk until a file is closed, so you end
up with long-running transactions that back up into your file channel.
Plus, if you are writing to a partitioned dataset you end up with
several open files and huge memory consumption. I recommend first
writing to Avro and then using a batch job to convert into Parquet.
If you really need to write directly to Parquet, take a look at the Kite
DatasetSink instead of using the HDFS sink. That allows you to write
directly to Parquet.
rb
On 10/26/2015 11:29 PM, [email protected] wrote:
hi all,
i want to convert the flume sink to the parquet format in the
serialization, but the parquet writer constructor need a path parameter, while
the flume serialization just provide a outputstream interface. i don't how to
solve it. who can give me a sample ,thanks。
[email protected]
--
Ryan Blue
Software Engineer
Cloudera, Inc.