Hello, the Arrow API in parquet-cpp is a much more convenient API for Parquet C++ users. It is tailored for columnar reads & writes but gives you a highlevel interface. We use it either to interact with Pandas or to pull data from/to the database using Turbodbc. If you can afford memory-wise to load all your data in to RAM, it might be simpler for you to convert the data to Arrow and then use the Arrow API. For Arrow we have implemented the state machine for the creation of definition and repetition levels in https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L67-L314
Uwe On Fri, Nov 3, 2017, at 05:16 AM, Sandeep Joshi wrote: > Uwe, > > >> As far as I understand you, you only are looking for the path r > ecords->Parquet? > > Yes. Btw, I am just curious about the Arrow API in parquet-cpp. > > If I first convert the records to Arrow and then Parquet, will nested > schemas work ? > > While converting from Parquet to records, you need to build an FSM for > reassembly to > handle definition level and repetition level vectors. > Where does this happen when you convert from Parquet to Arrow to some > json > record ? > My questions are specific to the cpp version of Arrow and Parquet. > > -Sandeep > > On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <[email protected]> wrote: > > > Hello Sandeep, > > > > we don't require the same class structure as in parquet-mr. Preferably > > they are very similar but they may differ. Some of parquet-mr's > > interfaces are specifically tailored to fit Hadoop whereas we don't have > > this requirement in the C++ implementation. Still, the interfaces should > > be suitable for more generic record conversion. Depending on if you know > > the structure of your records at compile time, using std::tuple<..> > > might be a good option. If you don't know the structure beforehand, we > > need a more dynamic interface. I would be happy to guide you a bit to > > implement this API in parquet-cpp. > > > > As far as I understand you, you only are looking for the path > > records->Parquet? > > > > Uwe > > > > On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote: > > > Hi Wes > > > > > > We have a rough implementation which does this conversion from > > > (currently) > > > rapidjson to parquet that we could contribute. > > > It will need a shepherd/guide to ensure it aligns with the parquet-cpp > > > implementation standards. > > > > > > Does the class structure in parquet-cpp have to be in one-to-one > > > correspondence with the parquet-mr ? > > > > > > I noticed that parquet-mr Record Conversion API has abstract classes like > > > WriteSupport, ReadSupport, > > > PrimitiveConverter, GroupConverter, RecordMaterializer, > > > ParquetInputFormat, > > > ParquetOutputFormat > > > which have to be implemented. I saw that these classes are currently > > > defined by avro, thrift and protobuf > > > converters (e.g. > > > https://github.com/apache/parquet-mr/tree/master/ > > parquet-avro/src/main/java/org/apache/parquet/avro > > > ) > > > > > > Would the parquet-cpp framework require the exact same framework ? > > > > > > -Sandeep > > > > > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <[email protected]> > > wrote: > > > > > > > hi Sandeep, > > > > > > > > This is more than welcome to be implemented, though I personally have > > > > no need for it (almost exclusively work with columnar data / Arrow). > > > > In addition to implementing the decoding to records, we would need to > > > > define a suitable record data structure in C++ which is decent amount > > > > of work. > > > > > > > > - Wes > > > > > > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <[email protected]> > > wrote: > > > > > The parquet-mr version has the Record Conversion API > > (RecordMaterializer, > > > > > RecordConsumer) which > > > > > can be used to convert to and from rows/tuples into the Parquet > > columnar > > > > > format. > > > > > > > > > > https://github.com/apache/parquet-mr/tree/master/ > > > > parquet-column/src/main/java/org/apache/parquet/io/api > > > > > > > > > > Are there any plans to add the same functionality to the parquet-cpp > > > > > codebase ? > > > > > > > > > > I checked the JIRA and couldn't find any outstanding issue although > > the > > > > > github README > > > > > does say "The 3rd layer would handle reading/writing records." > > > > > https://github.com/apache/parquet-cpp/blob/master/README.md/ > > > > > > > > > > -Sandeep > > > > > >
