Xianjin YE created PARQUET-1166:
-----------------------------------

             Summary: [API Proposal] Add GetRecordBatchReader in 
parquet/arrow/reader.h
                 Key: PARQUET-1166
                 URL: https://issues.apache.org/jira/browse/PARQUET-1166
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-cpp
            Reporter: Xianjin YE


Hi, I'd like to proposal a new API to better support splittable reading for 
Parquet File.

The intent for this API is that we can selective reading RowGroups(normally be 
contiguous, but can be arbitrary as long as the row_group_idxes are sorted and 
unique, [1, 3, 5] for example). 

The proposed API would be something like this:

{code:java}
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                
std::shared_ptr<::arrow::RecordBatchReader>* out);
                
::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                                                const 
std::vector<int>& column_indices,
                                                                
std::shared_ptr<::arrow::RecordBatchReader>* out);

{code}

With new API, we can split Parquet file into RowGroups and can be processed by 
multiple tasks(maybe be on different hosts, like the Map task in MapReduce)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to