[
https://issues.apache.org/jira/browse/HUDI-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244297#comment-17244297
]
Udit Mehrotra edited comment on HUDI-1401 at 12/4/20, 9:47 PM:
---------------------------------------------------------------
At a high level, to support metadata based file listing with Presto we will
have to inject another implementation for listing the files within each
partition
[here|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HadoopDirectoryLister.java#L45].
Here presto needs *LocatedFileStatus* instead of the regular *FileStatus*.
LocatedFileStatus also stores *blockLocations* and it seems like Presto uses
it, and if it is not available it leads to extra RPC calls to name node which
is what we tried to solve with the [PathFilter
approach|https://prestodb.io/blog/2020/08/04/prestodb-and-hudi#moving-away-from-inputformatgetsplits].
Given this, I believe to have this optimization we will have to store block
locations in the metadata table as well, and keep it updated as well. While I
don't see this causing any issues for S3 but for HDFS it might be something to
consider. Would like to get some thoughts on if this is a blocker for HDFS ?
Also, based on recent discussion we are also planning to get rid of the
PathFilter for presto and instead directly use the FileSystemView within Presto
code for filtering of latest commit files. The reason is that PathFilter is
applied to each and every file, and Presto does all this only at the presto
co-ordinator and it is a major bottleneck based on a recent investigation we
did for an EMR customer. Ultimately we disabled the PathFilter and were able to
obtain better performance through InputFormat getSplits approach.
cc [~vinoth] [~bhasudha] [~pwason]
was (Author: uditme):
At a high level, to support metadata based file listing with Presto we will
have to inject another implementation for listing the files within each
partition
[here|https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HadoopDirectoryLister.java#L45].
Here presto needs *LocatedFileStatus* instead of the regular *FileStatus*.
LocatedFileStatus also stored *blockLocations* and it seems like Presto needs
it, and if is not available it leads to extra RPC calls to name node which is
what we tried to solve with the [PathFilter
approach|https://prestodb.io/blog/2020/08/04/prestodb-and-hudi#moving-away-from-inputformatgetsplits].
Given this, I believe to have this optimization we will have to store block
locations in the metadata table as well, and keep it updated as well. While I
don't see this causing any issues for S3 but for HDFS it might be something to
consider. Would like to get some thoughts on if this is a blocker for HDFS ?
Also, based on recent discussion we are also planning to get rid of the
PathFilter for presto and instead directly use the FileSystemView within Presto
code for filtering of latest commit files. The reason is that PathFilter is
applied to each and every file, and Presto does all this only at the presto
co-ordinator and it is a major bottleneck based on a recent investigation we
did for an EMR customer. Ultimately we disabled the PathFilter and were able to
obtain better performance through InputFormat getSplits approach.
cc [~vinoth] [~bhasudha] [~pwason]
> Presto use of Metadata Table for file listings
> ----------------------------------------------
>
> Key: HUDI-1401
> URL: https://issues.apache.org/jira/browse/HUDI-1401
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Presto Integration
> Reporter: Vinoth Chandar
> Assignee: Udit Mehrotra
> Priority: Blocker
> Fix For: 0.7.0
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)