[jira] [Commented] (HADOOP-19199) Include FileStatus when opening a file from FileSystem

ASF GitHub Bot (Jira) Tue, 10 Dec 2024 06:38:40 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17904526#comment-17904526
 ]


ASF GitHub Bot commented on HADOOP-19199:
-----------------------------------------

steveloughran commented on PR #6877:
URL: https://github.com/apache/hadoop/pull/6877#issuecomment-2531814999

   FYI parquet trunk now uses openFile() with a file status and declared read 
policy "parquet, vector, random", so all hadoop releases >= 3.3.0 will at least 
use random S3 IO; 3.4.0/3.4.1 uses vector IO and 3.4.2 may use parquet specific 
code paths.
   
   This will come in parquet 15.1, leaving Avro and ORC as the next targets.
   
   Please grab and test that parquet beta release to make sure it does what you 
expect with S3 and Azure both reducing a HEAD per file




> Include FileStatus when opening a file from FileSystem
> ------------------------------------------------------
>
>                 Key: HADOOP-19199
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19199
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 3.4.0
>            Reporter: Oliver Caballero Alvarez
>            Priority: Major
>              Labels: pull-request-available
>
> The FileSystem abstract class prevents that if you have information about the 
> FileStatus of a file, you use it to open that file, which means that in the 
> implementations of the open method, they have to request the FileStatus of 
> the same file again, making unnecessary requests.
> A very clear example is seen in today's latest version of the parquet-hadoop 
> implementation, where:
> https://github.com/apache/parquet-java/blob/apache-parquet-1.14.0/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java
> Although to create the implementation you had to consult the file to know its 
> FileStatus, when opening it only the path is included, since the FileSystem 
> implementation is the only thing it allows you to do. This implies that the 
> implementation will surely, in its open function, verify that the file exists 
> or what information the file has and perform the same operation again to 
> collect the FileStatus.
>  
> This would simply be resolved by taking the latest current version:
>  
> [https://github.com/apache/hadoop/blob/release-3.4.0-RC3/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java]
> and including the following:
>  
>   public FSDataInputStream open(FileStatus f) throws IOException {
>         return this.open(f.getPath(), 
> this.getConf().getInt("io.file.buffer.size", 4096));
>     }
>  
> This would imply that it is backward compatible with all current Filesystems, 
> but since it is in the implementation it could be used when this information 
> is already known.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19199) Include FileStatus when opening a file from FileSystem

Reply via email to