[ 
https://issues.apache.org/jira/browse/AVRO-3594?focusedWorklogId=798337&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-798337
 ]

ASF GitHub Bot logged work on AVRO-3594:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Aug/22 09:23
            Start Date: 05/Aug/22 09:23
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on code in PR #1807:
URL: https://github.com/apache/avro/pull/1807#discussion_r938627811


##########
lang/java/mapred/src/main/java/org/apache/avro/mapred/FsInput.java:
##########
@@ -41,7 +43,15 @@ public FsInput(Path path, Configuration conf) throws 
IOException {
   /** Construct given a path and a {@code FileSystem}. */
   public FsInput(Path path, FileSystem fileSystem) throws IOException {
     this.len = fileSystem.getFileStatus(path).getLen();
-    this.stream = fileSystem.open(path);
+    // use the hadoop 3.3.0 openFile API and specify length
+    // and read policy. object stores can use these to
+    // optimize read performance.
+    // the read policy "adaptive" means "start sequential but
+    // go to random IO after backwards seeks"
+    // Filesystems which don't recognize the options will ignore them
+
+    this.stream = 
awaitFuture(fileSystem.openFile(path).opt("fs.option.openfile.read.policy", 
"adaptive")
+        .opt("fs.option.openfile.length", Long.toString(len)).build());

Review Comment:
   we only added those options explicitly in branch-3.3; the code wouldn't 
compile/link with 3.3.0. hence the strings. Unfortunately it also means you 
don't get that speedup until 3.3.5 ships ....but the release will be ready.
   3.3.0 did add the `withFileStatus(FileStatus)` param which was my original 
design for passing in all filestatus info, inc etag and maybe version, so you 
can go straight from listing to opening.
   
   first in s3a, added abfs in 3.3.5. but it is too brittle because the path 
checking requires status.getPath to equal the path opened. and hive with its 
wrapper fs doesn't always do that.
   Passing in file length is guaranteed to be ignored or actually used...no 
brittleness. it also suits hive where workers know the length of the file but 
don't have a status.
   
   One thing i can add with immediate benefit in 3.3.0 is the initial the 
fs.s3a.experimental.fadvise option, which again can mandate be adaptive, even 
on hive clusters where they explicitly set read policy to be random (which some 
do for max orc/parquet performance). The new opt fs.option.openfile.read.policy 
is an evolution of that (you can now specify a list of policies and the first 
one recognised is understood. if someone ever implemented explicit "parquet", 
"orc", and "avro" for example), you could say the read policy is "orc, random, 
adaptive" and get the first one known.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 798337)
    Time Spent: 40m  (was: 0.5h)

> FsInput to use openFile() API for cloud storage read performance
> ----------------------------------------------------------------
>
>                 Key: AVRO-3594
>                 URL: https://issues.apache.org/jira/browse/AVRO-3594
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.11.2
>            Reporter: Steve Loughran
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> avro can now use the FileSystem.openFile() API to open a file on a hadoop 
> filesystem connector (HADOOP-15229).
> by setting the file length and fadvise policy through opt() calls, the 
> clients can
> * skip a HEAD request when opening a file
> * optimise the ranges of GET request for sequential access, even in clusters 
> where s3a has been configured to use random iO (which some hive clusters do)
> filesystems/releases which don't recognise the options added in HADOOP-16202 
> will ignore them; the api will fall back to classic open(path) API call if 
> the connector doesn't have a custom implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to