Re: parquet-cpp question : ParquetFileWriter and Arrow schema conversion

Wes McKinney Mon, 27 Nov 2017 07:19:53 -0800

You can see some sample usages in the Cython wrappers of this code for Python:


https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx

On Mon, Nov 27, 2017 at 6:58 AM, Sandeep Joshi <[email protected]> wrote:
> thanks!   Is there sample code on how to use these APIs to learn best
> practices ?
>
> I am looking at
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/python
> but that only covers Arrow itself
>
> -Sandeep
>
> On Sun, Nov 26, 2017 at 9:57 PM, Wes McKinney <[email protected]> wrote:
>
>> I think you want to use parquet::arrow::FileWriter::Open
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/writer.h#L112
>>
>> The implementation is here:
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/writer.cc#L992
>>
>> - Wes
>>
>> On Sun, Nov 26, 2017 at 8:25 AM, Sandeep Joshi <[email protected]>
>> wrote:
>> > This might seem like a dumb question but I am not intimate with the API
>> yet
>> > to figure out how to get around this problem.
>> >
>> > I have a pre-defined Arrow Schema which I convert to Parquet Schema using
>> > the "ToParquetSchema" function.  This returns a SchemaDescriptor object.
>> > https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/schema.h#L80
>> >
>> > ParquetFileWriter on the other hand, expects a shared_ptr<GroupNode>
>> > https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/file/writer.h#L126
>> >
>> > SchemaDescriptor can return a raw pointer for GroupNode but to pass it to
>> > the ParquetFileWriter, I need a shared_ptr.   This introduces memory
>> > management complications.  I'd rather not create a copy of the GroupNode
>> in
>> > order to pass it to ParquetFileWriter.
>> >
>> >  * // convert arrow schema to parquet schema*
>> > *  std::shared_ptr<SchemaDescriptor> parquet_schema;*
>> > *  std::shared_ptr<::parquet::WriterProperties> properties =*
>> > *    ::parquet::default_writer_properties();*
>> > *  ToParquetSchema(arrow_sch.get(), *properties.get(),
>> &parquet_schema);*
>> >
>> > *  // write arrow table to parquet*
>> > *  parquet::schema::GroupNode* g =
>> > (parquet::schema::GroupNode*)parquet_schema->group_node();*
>> > *  grp_node.reset(g);  // Dont want to do this !*
>> > *  std::shared_ptr<::arrow::io::FileOutputStream> sink;*
>> > *  ::arrow::io::FileOutputStream::Open(path, &sink);*
>> > *  std::unique_ptr<FileWriter> arrow_writer(*
>> > *    new FileWriter(pool, ParquetFileWriter::Open(sink, grp_node)));*
>> >
>> > *  arrow_writer->WriteTable(*new_table_ptr.get(), 65536);*
>> >
>> > Is this an API limitation that no one has hit before ? Or I am missing a
>> > better way of writing parquet files given a pre-defined arrow schema.
>> >
>> > -Sandeep
>>

Re: parquet-cpp question : ParquetFileWriter and Arrow schema conversion

Reply via email to