[
https://issues.apache.org/jira/browse/PARQUET-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633934#comment-17633934
]
ASF GitHub Bot commented on PARQUET-2213:
-----------------------------------------
sunchao commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1021810987
##########
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##########
@@ -41,4 +41,16 @@ public interface InputFile {
*/
SeekableInputStream newStream() throws IOException;
+ /**
+ * Open a new {@link SeekableInputStream} for the underlying data file,
+ * in the range of '[offset, offset + length)'
+ *
+ * @param offset the offset in the file to read from
+ * @param length the total number of bytes to read
+ * @return a new {@link SeekableInputStream} to read the file
+ * @throws IOException if the stream cannot be opened
+ */
+ default SeekableInputStream newStream(long offset, long length) throws
IOException {
Review Comment:
> If we need to read multiple part of a parquet file (e.g. different row
groups, page index, footer, etc.), should we call it multiple times for each
individual part?
Yea that is the use case we are looking at, ideally we can do parallel
downloading for different `Chunk`s in `ParquetFileReader`. Although this PR
doesn't cover that part of the changes yet.
For S3 in particular, we can use this API to only download a range of the
file (with `fs.s3a.experimental.input.fadvise` set to `random`), which could
give better performance.
> problem with length here is it split length or file length? as with
splittable text formats different tasks may get their own split and are allowed
to read past it. this is why the fs builder api has two different options for
file len and split start/end and we can't use split end as the value for file
length.
This is more like "range length". The idea is to break up a row group into
smaller ranges (think of column chunks, or consecutive pages in case of column
index) which can be processed in parallel.
> Add an alternative InputFile.newStream that allow an input range
> ----------------------------------------------------------------
>
> Key: PARQUET-2213
> URL: https://issues.apache.org/jira/browse/PARQUET-2213
> Project: Parquet
> Issue Type: Improvement
> Reporter: Chao Sun
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)