thanks Uwe! I will go with Arrow On Fri, Nov 3, 2017 at 11:01 PM, Uwe L. Korn <[email protected]> wrote:
> Hello, > > the Arrow API in parquet-cpp is a much more convenient API for Parquet > C++ users. It is tailored for columnar reads & writes but gives you a > highlevel interface. We use it either to interact with Pandas or to pull > data from/to the database using Turbodbc. If you can afford memory-wise > to load all your data in to RAM, it might be simpler for you to convert > the data to Arrow and then use the Arrow API. For Arrow we have > implemented the state machine for the creation of definition and > repetition levels in > https://github.com/apache/parquet-cpp/blob/master/src/ > parquet/arrow/writer.cc#L67-L314 > > Uwe > > On Fri, Nov 3, 2017, at 05:16 AM, Sandeep Joshi wrote: > > Uwe, > > > > >> As far as I understand you, you only are looking for the path r > > ecords->Parquet? > > > > Yes. Btw, I am just curious about the Arrow API in parquet-cpp. > > > > If I first convert the records to Arrow and then Parquet, will nested > > schemas work ? > > > > While converting from Parquet to records, you need to build an FSM for > > reassembly to > > handle definition level and repetition level vectors. > > Where does this happen when you convert from Parquet to Arrow to some > > json > > record ? > > My questions are specific to the cpp version of Arrow and Parquet. > > > > -Sandeep > > > > On Thu, Nov 2, 2017 at 11:07 PM, Uwe L. Korn <[email protected]> wrote: > > > > > Hello Sandeep, > > > > > > we don't require the same class structure as in parquet-mr. Preferably > > > they are very similar but they may differ. Some of parquet-mr's > > > interfaces are specifically tailored to fit Hadoop whereas we don't > have > > > this requirement in the C++ implementation. Still, the interfaces > should > > > be suitable for more generic record conversion. Depending on if you > know > > > the structure of your records at compile time, using std::tuple<..> > > > might be a good option. If you don't know the structure beforehand, we > > > need a more dynamic interface. I would be happy to guide you a bit to > > > implement this API in parquet-cpp. > > > > > > As far as I understand you, you only are looking for the path > > > records->Parquet? > > > > > > Uwe > > > > > > On Thu, Nov 2, 2017, at 04:44 PM, Sandeep Joshi wrote: > > > > Hi Wes > > > > > > > > We have a rough implementation which does this conversion from > > > > (currently) > > > > rapidjson to parquet that we could contribute. > > > > It will need a shepherd/guide to ensure it aligns with the > parquet-cpp > > > > implementation standards. > > > > > > > > Does the class structure in parquet-cpp have to be in one-to-one > > > > correspondence with the parquet-mr ? > > > > > > > > I noticed that parquet-mr Record Conversion API has abstract classes > like > > > > WriteSupport, ReadSupport, > > > > PrimitiveConverter, GroupConverter, RecordMaterializer, > > > > ParquetInputFormat, > > > > ParquetOutputFormat > > > > which have to be implemented. I saw that these classes are > currently > > > > defined by avro, thrift and protobuf > > > > converters (e.g. > > > > https://github.com/apache/parquet-mr/tree/master/ > > > parquet-avro/src/main/java/org/apache/parquet/avro > > > > ) > > > > > > > > Would the parquet-cpp framework require the exact same framework ? > > > > > > > > -Sandeep > > > > > > > > On Thu, Nov 2, 2017 at 8:27 PM, Wes McKinney <[email protected]> > > > wrote: > > > > > > > > > hi Sandeep, > > > > > > > > > > This is more than welcome to be implemented, though I personally > have > > > > > no need for it (almost exclusively work with columnar data / > Arrow). > > > > > In addition to implementing the decoding to records, we would need > to > > > > > define a suitable record data structure in C++ which is decent > amount > > > > > of work. > > > > > > > > > > - Wes > > > > > > > > > > On Thu, Nov 2, 2017 at 3:38 AM, Sandeep Joshi <[email protected] > > > > > wrote: > > > > > > The parquet-mr version has the Record Conversion API > > > (RecordMaterializer, > > > > > > RecordConsumer) which > > > > > > can be used to convert to and from rows/tuples into the Parquet > > > columnar > > > > > > format. > > > > > > > > > > > > https://github.com/apache/parquet-mr/tree/master/ > > > > > parquet-column/src/main/java/org/apache/parquet/io/api > > > > > > > > > > > > Are there any plans to add the same functionality to the > parquet-cpp > > > > > > codebase ? > > > > > > > > > > > > I checked the JIRA and couldn't find any outstanding issue > although > > > the > > > > > > github README > > > > > > does say "The 3rd layer would handle reading/writing records." > > > > > > https://github.com/apache/parquet-cpp/blob/master/README.md/ > > > > > > > > > > > > -Sandeep > > > > > > > > >
