[ 
https://issues.apache.org/jira/browse/HUDI-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244297#comment-17244297
 ] 

Udit Mehrotra edited comment on HUDI-1401 at 12/4/20, 9:47 PM:
---------------------------------------------------------------

At a high level, to support metadata based file listing with Presto we will 
have to inject another implementation for listing the files within each 
partition 
[here|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HadoopDirectoryLister.java#L45].
 Here presto needs *LocatedFileStatus* instead of the regular *FileStatus*. 
LocatedFileStatus also stores *blockLocations* and it seems like Presto uses 
it, and if it is not available it leads to extra RPC calls to name node which 
is what we tried to solve with the [PathFilter 
approach|https://prestodb.io/blog/2020/08/04/prestodb-and-hudi#moving-away-from-inputformatgetsplits].
 Given this, I believe to have this optimization we will have to store block 
locations in the metadata table as well, and keep it updated as well. While I 
don't see this causing any issues for S3 but for HDFS it might be something to 
consider. Would like to get some thoughts on if this is a blocker for HDFS ?

Also, based on recent discussion we are also planning to get rid of the 
PathFilter for presto and instead directly use the FileSystemView within Presto 
code for filtering of latest commit files. The reason is that PathFilter is 
applied to each and every file, and Presto does all this only at the presto 
co-ordinator and it is a major bottleneck based on a recent investigation we 
did for an EMR customer. Ultimately we disabled the PathFilter and were able to 
obtain better performance through InputFormat getSplits approach.

cc [~vinoth] [~bhasudha] [~pwason]


was (Author: uditme):
At a high level, to support metadata based file listing with Presto we will 
have to inject another implementation for listing the files within each 
partition 
[here|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HadoopDirectoryLister.java#L45].
 Here presto needs *LocatedFileStatus* instead of the regular *FileStatus*. 
LocatedFileStatus also stored *blockLocations* and it seems like Presto needs 
it, and if is not available it leads to extra RPC calls to name node which is 
what we tried to solve with the [PathFilter 
approach|https://prestodb.io/blog/2020/08/04/prestodb-and-hudi#moving-away-from-inputformatgetsplits].
 Given this, I believe to have this optimization we will have to store block 
locations in the metadata table as well, and keep it updated as well. While I 
don't see this causing any issues for S3 but for HDFS it might be something to 
consider. Would like to get some thoughts on if this is a blocker for HDFS ? 

Also, based on recent discussion we are also planning to get rid of the 
PathFilter for presto and instead directly use the FileSystemView within Presto 
code for filtering of latest commit files. The reason is that PathFilter is 
applied to each and every file, and Presto does all this only at the presto 
co-ordinator and it is a major bottleneck based on a recent investigation we 
did for an EMR customer. Ultimately we disabled the PathFilter and were able to 
obtain better performance through InputFormat getSplits approach.

cc [~vinoth] [~bhasudha] [~pwason]

> Presto use of Metadata Table for file listings
> ----------------------------------------------
>
>                 Key: HUDI-1401
>                 URL: https://issues.apache.org/jira/browse/HUDI-1401
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Presto Integration
>            Reporter: Vinoth Chandar
>            Assignee: Udit Mehrotra
>            Priority: Blocker
>             Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to