[GitHub] [parquet-mr] steveloughran commented on pull request #1010: PARQUET-2213: add InputFile.newStream with a read range

GitBox Mon, 14 Nov 2022 03:48:51 -0800


steveloughran commented on PR #1010:
URL: https://github.com/apache/parquet-mr/pull/1010#issuecomment-1313557921


   I would prefer if Parquet used the same opt(key, value) builder pattern that 
we use in the new hadoop FS api calls. 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/fsdatainputstreambuilder.html
   
   This allows for future addition of new options. The reader could then take 
them and, where appropriate, map them to the hadoop openfile options 
org.apache.hadoop.fs.Options.OpenFileOptions#FS_OPTION_OPENFILE_STANDARD_OPTIONS
 which can then get picked up by the connector.
   
   passing in split/start end and file length is good.
   file length: used by s3a to skip the HEAD when opening; abfs and gcs could 
copy. abfs will take a FileStatus in the withFileStatus() parameter
   split start: where to begin that read
   split end: should be used by prefetchers to know where to stop prefetching
   
   parquet should set the read policy itself, i'd go for "random, adaptive" as 
the ordered list, with "vectored" in front of that when vectored IO is to be 
used.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] steveloughran commented on pull request #1010: PARQUET-2213: add InputFile.newStream with a read range

Reply via email to