Hello Community, Apache Parquet <https://parquet.apache.org/documentation/latest/> is a columnar oriented binary file format designed to be extremely efficient and interoperable across Hadoop ecosystem. It has integrations with most of the Hadoop processing frameworks ( Impala, Hive, Pig, Spark.. ) and serialization models (Thrift, Avro, Protobuf) making it easy to use in ETL and processing pipelines.
Having an operator to write data to Parquet files would certainly be a good addition to the Malhar library. The underlying implementation <http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/> for writing data as Parquet, requires a subclass of parquet.hadoop.api.WriteSupport that knows how to take an in-memory object and write Parquet primitives through parquet.io.api.RecordConsumer*.* Currently, there are several WriteSupport implementations, including ThriftWriteSupport, AvroWriteSupport, and ProtoWriteSupport. These WriteSupport implementations are then wrapped as ParquetWriter objects for writing. Parquet Writers do not expose a handle to the underlying stream. In order to write data to a Parquet file, all the records ( that belong to file ) must be buffered in memory. These records are then compressed and later flushed to the file. To start with, we could support following features in the operator - *Ability to provide a WriteSupport Implementation* : The user should be able to use existing implementations of parquet.hadoop.api. WriteSupport or provide his/her own implementation. - *Ability to configure Page Size : *Refers to the amount of uncompressed data for a single column that is read before it is compressed as a unit and buffered in memory to be written out as a “page”. Default value : 1MB - *Ability to configure Parquet Block Size : *Refers to the amount of compressed data that should be buffered in memory before a row group is written out to disk. Larger block sizes require more memory to buffer the data; Recommended is 128 MB / 256 MB - *Flushing files periodically* :Operator would have to flush files periodically in a specified directory as per configured block size . This could be time-based / number of events based / size based To implement the operator, here's one approach : 1. Extend existing AbstractFileOutputOperator 2. Provide methods to add write support implementations. 3. In process method, hold the data in memory till we reach a configured size and then flush the contents to a file during endWindow(). Please send across your thoughts on this. I would also like to know if we would be able to leverage recovery mechanisms provided by AbstractFileOutputOperator using this approach? Thanks, Shubham
