Gawain BOLTON created PARQUET-1678:
--------------------------------------

             Summary: [C++] Provide classes for reading/writing using 
input/output operators
                 Key: PARQUET-1678
                 URL: https://issues.apache.org/jira/browse/PARQUET-1678
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Gawain BOLTON


The current Parquet APIs allow for reading/writing data using either:
 # A high level API whereby all data for an each column is given to an 
arrow::*Builder class.
 # Or a low-level API using parquet::*Writer classes which allows for a column 
to be selected and data items added to the column as needed.

Using the low-level approach gives great flexibility but makes for cumbersome 
code and requires casting each column to the required  type.

I propose offering StreamReader and StreamWriter classes with C++ input/output 
operators allowing for data to be written like this:
{code:java}
// N.B. schema has 3 columns of type std::string, std::int32_t and float.
auto file_writer{ parquet:ParquetFileWriter::Open(...) };
StreamWriter sw{ file_writer };
// Write to output file using output operator.
sw << "A string" << 3 << 4.5f;
sw.nextRow();
...{code}
 

Similary reading would be done as follows:
{code:java}
auto file_reader{ parquet::ParquetFileReader::Open(...) };
StreamReader sr{ file_reader };
std::string s; std::int32_t i; float f;
sr >> s >> i >> f;
sr.nextRow();{code}
I have written such classes and an example file which shows how to use them.

I think that they allow for a more simple and natural API since:
 * No casting is needed.
 * Code is simple, easy to read.
 * User defined types are easily be accommodated by having the user provide the 
input/output operator for the type.
 * Row groups can be created "automatically" when a given amount of user data 
has been written, or explicitly by a StreamWriter method such as 
"createNewRowGroup()"

I have created this ticket because where I work (www.cfm.fr) we are very 
interested in using Parquet, but our users have requested a stream like API.   
We think others might also be interested in this functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to