[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #1010: PARQUET-2213: add InputFile.newStream with a read range

2022-11-15 Thread GitBox


steveloughran commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1022697746


##
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##
@@ -41,4 +41,16 @@ public interface InputFile {
*/
   SeekableInputStream newStream() throws IOException;
 
+  /**
+   * Open a new {@link SeekableInputStream} for the underlying data file,
+   * in the range of '[offset, offset + length)'
+   *
+   * @param offset the offset in the file to read from
+   * @param length the total number of bytes to read
+   * @return a new {@link SeekableInputStream} to read the file
+   * @throws IOException if the stream cannot be opened
+   */
+  default SeekableInputStream newStream(long offset, long length) throws 
IOException {

Review Comment:
   you should go with the hadoop 
https://issues.apache.org/jira/browse/HADOOP-16202 options; s3a fs now reads 
them and it lines up abfs/gcs for the same. you can declare split start/end as 
well as file length so that
   * length => client can skip existence probes, they know the file limit
   * spilt range: they know to not prefetch past the end of the split, if they 
prefetch
   * read policy: standard set of policies and a parse policy of "csv list of 
policies -pick the first one you recognise". again, can be used by all the 
stores



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #1010: PARQUET-2213: add InputFile.newStream with a read range

2022-11-14 Thread GitBox


steveloughran commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1021474458


##
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##
@@ -41,4 +41,16 @@ public interface InputFile {
*/
   SeekableInputStream newStream() throws IOException;
 
+  /**
+   * Open a new {@link SeekableInputStream} for the underlying data file,
+   * in the range of '[offset, offset + length)'
+   *
+   * @param offset the offset in the file to read from
+   * @param length the total number of bytes to read
+   * @return a new {@link SeekableInputStream} to read the file
+   * @throws IOException if the stream cannot be opened
+   */
+  default SeekableInputStream newStream(long offset, long length) throws 
IOException {

Review Comment:
   problem with length here is it *split length* or *file length*? as with 
splittable text formats different tasks may get their own split and are allowed 
to read past it. this is why the fs builder api has two different options for 
file len and split start/end and we can't use split end as the value for file 
length.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] steveloughran commented on a diff in pull request #1010: PARQUET-2213: add InputFile.newStream with a read range

2022-11-14 Thread GitBox


steveloughran commented on code in PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#discussion_r1021472465


##
parquet-common/src/main/java/org/apache/parquet/io/InputFile.java:
##
@@ -41,4 +41,16 @@ public interface InputFile {
*/
   SeekableInputStream newStream() throws IOException;
 
+  /**
+   * Open a new {@link SeekableInputStream} for the underlying data file,
+   * in the range of '[offset, offset + length)'
+   *
+   * @param offset the offset in the file to read from
+   * @param length the total number of bytes to read
+   * @return a new {@link SeekableInputStream} to read the file
+   * @throws IOException if the stream cannot be opened
+   */
+  default SeekableInputStream newStream(long offset, long length) throws 
IOException {

Review Comment:
   this is what the vectored IO API of hadoop 3.3.5, an API for which we intend 
to provide in a compatibility library for apps running against hadoop 3.2.x+. 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md#default-void-readvectoredlist-extends-filerange-ranges-intfunctionbytebuffer-allocate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org