In-lined some comments.
On Tue, Oct 11, 2016 at 5:38 PM, James Pirz <james.p...@gmail.com> wrote:
> I am using C++ and I need to convert a corpus of JSON documents, whose
> schema is not fixed/known in advance, into Parquet format for efficient
> processing/storage. I have gone through a number of examples and test-cases
> to get an idea about the best way to do it, however I am still confused. I
> believe I need to use ParquetWriter and ParquetReader and I am basically
> trying to understand:
> 1- Is it really a requirement to use Avro, Thrift or Protobuf for this
> purpose (all the examples seem to use one of them) ? I know the schema info
> needs to be stored in the footer of Parquet files, but does it mean that I
> need to know the schema ahead of time and do I have to use one of those 3
> to store an in-memory representation of my objects or Can I directly feed
> the parsed JSON docs into a ParquetWriter ? (Using Avro, Thrift or Protobuf
> creates extra dependency which I am really trying to avoid).
Parquet is a structured columnar file format. Hence, you will need to
need to specify the schema in advance.
You can feed typed data directly into C++ ParquetWriter. In your case,
you will need to first extract "data and type information" using JSON
> 2- Almost all the examples I found are described in Java. I am using C++
> and I am really looking for an example in that context. I have looked at
> a couple of test-cases under parquet-cpp repo, however I am just wondering
> if a succinct example is available in C++ to get an idea for such a
We are in the process of providing a detailed example on how to write
and read Parquet files using C++.
Jira tracking this https://issues.apache.org/jira/browse/PARQUET-702
I expect this Jira to be completed very soon.
For now, the test cases are your best source.
> Any hint or suggestion would be highly appreciated.