Hello Sandeep, we don't require the same class structure as in parquet-mr. Preferably they are very similar but they may differ. Some of parquet-mr's interfaces are specifically tailored to fit Hadoop whereas we don't have this requirement in the C++ implementation. Still, the interfaces should be suitable for more generic record conversion. Depending on if you know the structure of your records at compile time, using std::tuple<..> might be a good option. If you don't know the structure beforehand, we need a more dynamic interface. I would be happy to guide you a bit to implement this API in parquet-cpp.
As far as I understand you, you only are looking for the path records->Parquet? Uwe On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote: > Hi Wes > > We have a rough implementation which does this conversion from > (currently) > rapidjson to parquet that we could contribute. > It will need a shepherd/guide to ensure it aligns with the parquet-cpp > implementation standards. > > Does the class structure in parquet-cpp have to be in one-to-one > correspondence with the parquet-mr ? > > I noticed that parquet-mr Record Conversion API has abstract classes like > WriteSupport, ReadSupport, > PrimitiveConverter, GroupConverter, RecordMaterializer, > ParquetInputFormat, > ParquetOutputFormat > which have to be implemented. I saw that these classes are currently > defined by avro, thrift and protobuf > converters (e.g. > https://github.com/apache/parquet-mr/tree/master/parquet-avro/src/main/java/org/apache/parquet/avro > ) > > Would the parquet-cpp framework require the exact same framework ? > > -Sandeep > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <[email protected]> wrote: > > > hi Sandeep, > > > > This is more than welcome to be implemented, though I personally have > > no need for it (almost exclusively work with columnar data / Arrow). > > In addition to implementing the decoding to records, we would need to > > define a suitable record data structure in C++ which is decent amount > > of work. > > > > - Wes > > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <[email protected]> wrote: > > > The parquet-mr version has the Record Conversion API (RecordMaterializer, > > > RecordConsumer) which > > > can be used to convert to and from rows/tuples into the Parquet columnar > > > format. > > > > > > https://github.com/apache/parquet-mr/tree/master/ > > parquet-column/src/main/java/org/apache/parquet/io/api > > > > > > Are there any plans to add the same functionality to the parquet-cpp > > > codebase ? > > > > > > I checked the JIRA and couldn't find any outstanding issue although the > > > github README > > > does say "The 3rd layer would handle reading/writing records." > > > https://github.com/apache/parquet-cpp/blob/master/README.md/ > > > > > > -Sandeep > >
