[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3616: [feat] Support using offset index in ParquetRecordBatchStream when pu…

GitBox Tue, 27 Sep 2022 06:25:43 -0700


Ted-Jiang commented on PR #3616:
URL: 
https://github.com/apache/arrow-datafusion/pull/3616#issuecomment-1259506416


   > > > AsyncFileReader
   > > 
   > > 
   > > @thinkharderdev Thanks! If am right, we should make the read index api 
into `async`, so i will file a ticket to replace below into async base on 
`AsyncFileReader::get_ranges` add to arrow-rs to make it concurrently as you 
mentioned.
   > > ```
   > >   // TODO add async version in arrow-rs avoid read whole file.
   > >         let bytes = store.get_range(&meta.location, 0..meta.size).await?;
   > >         let mut location_vec = vec![];
   > >         let mut index_vec = vec![];
   > >         for rg in result_meta.row_groups() {
   > >             location_vec.push(index_reader::read_pages_locations(&bytes, 
rg.columns())?);
   > >             index_vec.push(index_reader::read_columns_indexes(&bytes, 
rg.columns())?);
   > >         }
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > I prefer keep read page_index in `ParquetFileReader::get_metadata` and 
save them in `ParquetMetaData` already define in arrow-rs
   > > ```
   > > /// Global Parquet metadata.
   > > #[derive(Debug, Clone)]
   > > pub struct ParquetMetaData {
   > >     file_metadata: FileMetaData,
   > >     row_groups: Vec<RowGroupMetaData>,
   > >     /// Page index for all pages in each column chunk
   > >     page_indexes: Option<ParquetColumnIndex>,
   > >     /// Offset index for all pages in each column chunk
   > >     offset_indexes: Option<ParquetOffsetIndex>,
   > > }
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > because it can reduce the code change , Secondly in `parquet open file` 
first thing we should do is `read file metadata`, following `build_row_filter`, 
`build_selection_base_on_index`(todo) should depend on this.🤔
   > 
   > I think we may be talking about different things :).
   > 
   > I'm saying the code to fetch the indexes already exists in arrow-rs so we 
don't need to duplicate the code in datafusion. You can just construct the 
`ArrowReadOptions` to enable the page index and 
`ParquetRecordBatchStreamBuilder` will fetch the indexes (and do so 
concurrently) (see 
https://github.com/apache/arrow-rs/blob/a7cf274765945af4111fddaeec26d672715de9d0/parquet/src/arrow/async_reader.rs#L225).
   > 
   > ```
   > let mut options = ArrowReaderOptions::new().with_page_index(true);
   > 
   > if enable_page_index {
   >    options = options.with_page_index(true);
   > }
   > 
   > let builder =
   >    ParquetRecordBatchStreamBuilder::new_with_options(async_reader, options)
   >       .await?
   > ```
   
   oh! i miss this part 😂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Ted-Jiang commented on pull request #3616: [feat] Support using offset index in ParquetRecordBatchStream when pu…

Reply via email to