[jira] [Comment Edited] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

Chao Sun (Jira) Mon, 28 Dec 2020 17:45:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255765#comment-17255765
 ]


Chao Sun edited comment on ARROW-11016 at 12/29/20, 1:44 AM:
-------------------------------------------------------------

Sorry for the late reply. Yes I think it should be possible. On the file reader 
side we can pass in a (start, end) besides the file handle, to indicate we want 
to only read a segment of the file. Then after parsing the file metadata, we 
can check all the row groups for the file and determine which row group(s) 
overlaps with the segment, and only select those. 

You can probably check relevant code in 
[Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]
 and 
[Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223]
 for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I 
thought we used to clone file handle so that they can be shared but yeah 
haven't looked at the code base for some time :(


was (Author: csun):
Sorry for the late reply. Yes I think it should be possible. On the file reader 
side we can pass in a (start, end) besides the file handle, to indicate we want 
to only read a segment of the file. Then after parsing the file metadata, we 
can check all the row groups for the file and determine which row group(s) 
overlaps with the segment, and only select those. 

You can probably check relevant code in 
[Spark|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L105]
 and 
[Parquet|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1223]
 for reference.

I'm not sure about the file handle sharing issue [~nevi_me] mentioned tho - I 
thought we used to clone file handle so that they can be shared but yeah 
haven't looked at the code base for some time :((

> [Rust] Parquet ArrayReader should allow reading a subset of row groups
> ----------------------------------------------------------------------
>
>                 Key: ARROW-11016
>                 URL: https://issues.apache.org/jira/browse/ARROW-11016
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Rust
>            Reporter: Andy Grove
>            Priority: Major
>
> Parquet ArrayReader currently only supports reading an entire file from start 
> to finish and does not allow selectively reading a subset of row groups. This 
> prevents us from parallelizing work across threads when processing a single 
> parquet file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11016) [Rust] Parquet ArrayReader should allow reading a subset of row groups

Reply via email to