Hi Rafeeq, I've added answers below.
On 08/12/2014 12:28 AM, rafeeq s wrote:
I am new to Parquet and using parquet format for storing spark stream data into hdfs. Question: 1. Is it possible to merge two small parquet file ?
There isn't a quick solution like concatenating the files, if that's what you're looking for. You'd have to rewrite the files as a new one.
As long as you're rewriting the file, you might want to consider staging the data in a different format and then compacting it into parquet periodically. Avro, for example, would allow you to use flush and sync methods to guarantee records are on disk and a later conversion to parquet would give you the I/O and encoding benefits.
2. Partitioning directory structure: Is it possible to partition the parquet file directory based on date?
Creating files in a partitioned structure isn't supported in parquet itself, but once they are in a partitioned structure the input format will walk the directory structure and find all of the files.
In parquet-avro, you pass in a path when creating a parquet file, so while you'd have to do some of this yourself, it isn't too difficult. I'm most familiar with the parquet-avro API, but you could probably build a partitioned structure with the other modules too.
(Disclosure: I work on Kite...) It sounds like what you're looking for is probably a library built on top of parquet that acts more like a data store than a file format. You might want to check out the Kite project [1], which does both of the things you're asking about. Specifically, you can use a config file to define your partition layout and a simple API to automatically select partitions.
rb [1]: http://kitesdk.org/docs/current/guide/ -- Ryan Blue Software Engineer Cloudera, Inc.
