Re: Record Conversion API in parquet-cpp

Uwe L. Korn Thu, 02 Nov 2017 10:38:49 -0700

Hello Sandeep,

we don't require the same class structure as in parquet-mr. Preferably
they are very similar but they may differ. Some of parquet-mr's
interfaces are specifically tailored to fit Hadoop whereas we don't have
this requirement in the C++ implementation. Still, the interfaces should
be suitable for more generic record conversion. Depending on if you know
the structure of your records at compile time, using std::tuple<..>
might be a good option. If you don't know the structure beforehand, we
need a more dynamic interface. I would be happy to guide you a bit to
implement this API in parquet-cpp.


As far as I understand you, you only are looking for the path
records->Parquet?

Uwe

On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> Hi Wes
> 
> We have a rough implementation which does this conversion from
> (currently)
> rapidjson to parquet that we could contribute.
> It will need a shepherd/guide to ensure it aligns with the parquet-cpp
> implementation standards.
> 
> Does the class structure in parquet-cpp have to be in one-to-one
> correspondence with the parquet-mr ?
> 
> I noticed that parquet-mr Record Conversion API has abstract classes like
> WriteSupport, ReadSupport,
> PrimitiveConverter, GroupConverter, RecordMaterializer,
> ParquetInputFormat,
> ParquetOutputFormat
> which have to be implemented.   I saw that these classes are currently
> defined by avro, thrift and protobuf
> converters (e.g.
> https://github.com/apache/parquet-mr/tree/master/parquet-avro/src/main/java/org/apache/parquet/avro
> )
> 
> Would the parquet-cpp framework require the exact same framework ?
> 
> -Sandeep
> 
> On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <[email protected]> wrote:
> 
> > hi Sandeep,
> >
> > This is more than welcome to be implemented, though I personally have
> > no need for it (almost exclusively work with columnar data / Arrow).
> > In addition to implementing the decoding to records, we would need to
> > define a suitable record data structure in C++ which is decent amount
> > of work.
> >
> > - Wes
> >
> > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <[email protected]> wrote:
> > > The parquet-mr version has the Record Conversion API (RecordMaterializer,
> > > RecordConsumer) which
> > > can be used to convert to and from rows/tuples into the Parquet columnar
> > > format.
> > >
> > > https://github.com/apache/parquet-mr/tree/master/
> > parquet-column/src/main/java/org/apache/parquet/io/api
> > >
> > > Are there any plans to add the same functionality to the parquet-cpp
> > > codebase ?
> > >
> > > I checked the JIRA and couldn't find any outstanding issue although the
> > > github README
> > > does say  "The 3rd layer would handle reading/writing records."
> > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > >
> > > -Sandeep
> >

Re: Record Conversion API in parquet-cpp

Reply via email to