[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3616: [feat] Support using offset index in ParquetRecordBatchStream when pu…

GitBox Mon, 26 Sep 2022 19:42:29 -0700


Ted-Jiang commented on PR #3616:
URL: 
https://github.com/apache/arrow-datafusion/pull/3616#issuecomment-1258890941


   > AsyncFileReader
   
   @thinkharderdev Thanks!  If am right, we should make the read index api into 
`async`, so i will file a ticket to replace below into async add to arrow-rs to 
make it concurrently as you mentioned.
   ```
     // TODO add async version in arrow-rs avoid read whole file.
           let bytes = store.get_range(&meta.location, 0..meta.size).await?;
           let mut location_vec = vec![];
           let mut index_vec = vec![];
           for rg in result_meta.row_groups() {
               location_vec.push(index_reader::read_pages_locations(&bytes, 
rg.columns())?);
               index_vec.push(index_reader::read_columns_indexes(&bytes, 
rg.columns())?);
           }
   
   ```
   I prefer keep read page_index in `ParquetFileReader::get_metadata` and save 
them in `ParquetMetaData` already define in arrow-rs
   ```
   /// Global Parquet metadata.
   #[derive(Debug, Clone)]
   pub struct ParquetMetaData {
       file_metadata: FileMetaData,
       row_groups: Vec<RowGroupMetaData>,
       /// Page index for all pages in each column chunk
       page_indexes: Option<ParquetColumnIndex>,
       /// Offset index for all pages in each column chunk
       offset_indexes: Option<ParquetOffsetIndex>,
   }
   ```
   because it can reduce the code change ,
   Secondly in `parquet open file` first thing we should do is `read file 
metadata`, following `build_row_filter`, `build_selection_base_on_index`(todo) 
should depend on this.🤔
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3616: [feat] Support using offset index in ParquetRecordBatchStream when pu…

Reply via email to