Re: Record Conversion API in parquet-cpp

Sandeep Joshi Fri, 03 Nov 2017 22:23:50 -0700

thanks Uwe!  I will go with Arrow

On Fri, Nov 3, 2017 at 11:01 PM, Uwe L. Korn <[email protected]> wrote:


> Hello,
>
> the Arrow API in parquet-cpp is a much more convenient API for Parquet
> C++ users. It is tailored for columnar reads & writes but gives you a
> highlevel interface. We use it either to interact with Pandas or to pull
> data from/to the database using Turbodbc. If you can afford memory-wise
> to load all your data in to RAM, it might be simpler for you to convert
> the data to Arrow and then use the Arrow API. For Arrow we have
> implemented the state machine for the creation of definition and
> repetition levels in
> https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/writer.cc#L67-L314
>
> Uwe
>
> On Fri, Nov 3, 2017, at 05:16 AM, Sandeep Joshi wrote:
> > Uwe,
> >
> > >> As far as I understand you, you only are looking for the path r
> > ecords->Parquet?
> >
> > Yes.  Btw, I am just curious about the Arrow API in parquet-cpp.
> >
> > If I first convert the records to Arrow and then Parquet, will nested
> > schemas work ?
> >
> > While converting from Parquet to records, you need to build an FSM for
> > reassembly to
> > handle definition level and repetition level vectors.
> > Where does this happen when you convert from Parquet to Arrow to some
> > json
> > record ?
> > My questions are specific to the cpp version of Arrow and Parquet.
> >
> > -Sandeep
> >
> > On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <[email protected]> wrote:
> >
> > > Hello Sandeep,
> > >
> > > we don't require the same class structure as in parquet-mr. Preferably
> > > they are very similar but they may differ. Some of parquet-mr's
> > > interfaces are specifically tailored to fit Hadoop whereas we don't
> have
> > > this requirement in the C++ implementation. Still, the interfaces
> should
> > > be suitable for more generic record conversion. Depending on if you
> know
> > > the structure of your records at compile time, using std::tuple<..>
> > > might be a good option. If you don't know the structure beforehand, we
> > > need a more dynamic interface. I would be happy to guide you a bit to
> > > implement this API in parquet-cpp.
> > >
> > > As far as I understand you, you only are looking for the path
> > > records->Parquet?
> > >
> > > Uwe
> > >
> > > On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote:
> > > > Hi Wes
> > > >
> > > > We have a rough implementation which does this conversion from
> > > > (currently)
> > > > rapidjson to parquet that we could contribute.
> > > > It will need a shepherd/guide to ensure it aligns with the
> parquet-cpp
> > > > implementation standards.
> > > >
> > > > Does the class structure in parquet-cpp have to be in one-to-one
> > > > correspondence with the parquet-mr ?
> > > >
> > > > I noticed that parquet-mr Record Conversion API has abstract classes
> like
> > > > WriteSupport, ReadSupport,
> > > > PrimitiveConverter, GroupConverter, RecordMaterializer,
> > > > ParquetInputFormat,
> > > > ParquetOutputFormat
> > > > which have to be implemented.   I saw that these classes are
> currently
> > > > defined by avro, thrift and protobuf
> > > > converters (e.g.
> > > > https://github.com/apache/parquet-mr/tree/master/
> > > parquet-avro/src/main/java/org/apache/parquet/avro
> > > > )
> > > >
> > > > Would the parquet-cpp framework require the exact same framework ?
> > > >
> > > > -Sandeep
> > > >
> > > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > hi Sandeep,
> > > > >
> > > > > This is more than welcome to be implemented, though I personally
> have
> > > > > no need for it (almost exclusively work with columnar data /
> Arrow).
> > > > > In addition to implementing the decoding to records, we would need
> to
> > > > > define a suitable record data structure in C++ which is decent
> amount
> > > > > of work.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <[email protected]
> >
> > > wrote:
> > > > > > The parquet-mr version has the Record Conversion API
> > > (RecordMaterializer,
> > > > > > RecordConsumer) which
> > > > > > can be used to convert to and from rows/tuples into the Parquet
> > > columnar
> > > > > > format.
> > > > > >
> > > > > > https://github.com/apache/parquet-mr/tree/master/
> > > > > parquet-column/src/main/java/org/apache/parquet/io/api
> > > > > >
> > > > > > Are there any plans to add the same functionality to the
> parquet-cpp
> > > > > > codebase ?
> > > > > >
> > > > > > I checked the JIRA and couldn't find any outstanding issue
> although
> > > the
> > > > > > github README
> > > > > > does say  "The 3rd layer would handle reading/writing records."
> > > > > > https://github.com/apache/parquet-cpp/blob/master/README.md/
> > > > > >
> > > > > > -Sandeep
> > > > >
> > >
>

Re: Record Conversion API in parquet-cpp

Reply via email to