Ted-Jiang commented on PR #1977:
URL: https://github.com/apache/arrow-rs/pull/1977#issuecomment-1172476897

   > I've had a quick review, unfortunately I think this is missing a key 
detail. In particular the arrow writer must read the same records from each of 
its columns. As written this simply skips reading pruned pages from columns. 
There is no relationship between the page boundaries across columns within a 
parquet, and therefore this will return different rows for each of the columns.
   
   Thanks @tustvold, your are right. Maybe I made the title confusing😭. as you 
mentioned in  [#1791 (review)]. 
(https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857):
   
   >Pass row selection down to RecordReader
   >Add a skip_next_page to PageReader
   >Add a skip_values to ColumnValueDecoder
   
   This pr is only about the `skip_next_page` part, we will only return the 
needed page metadata in iterator. As make the  same records from each of its 
columns (row align), i prefer support in next pr. I prefer to separate them to 
avoid huge PR and conflict. If you prefer to combine them, I will make this in 
progress and keep developing.
   
   > As described in [#1791 
(review)](https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857),
 you will need to extract the row selection in addition to the page selection, 
and push this into RecordReader and ColumnValueDecoder. This will also make the 
API clearer, as we aren't going behind their back and skipping pages at the 
block-level
   As above, need pass the `row_ranges` to ColumnValueReader in future.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to