[
https://issues.apache.org/jira/browse/HUDI-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246169#comment-17246169
]
Udit Mehrotra commented on HUDI-1401:
-------------------------------------
[~vinoth] I agree that this is not something that would be feasible to keep a
track off and keep updated, and we should explore the path of not setting block
locations when using Hudi's metadata listing feature.
In case of S3, EmrFS does this:
{code:java}
BlockLocation[] locations = getFileBlockLocations(status, 0,
status.getLen());{code}
This would reuse
[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L866]
and basically create one block with the entire file size.
I did an initial version of the presto implementation to use metadata file
listing. Not setting any blocks throws an error at
[https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/util/InternalHiveSplitFactory.java#L218].
Setting that single block works for S3. I need to do more scale testing for
S3, and will also test HDFS with the same approach.
> Presto use of Metadata Table for file listings
> ----------------------------------------------
>
> Key: HUDI-1401
> URL: https://issues.apache.org/jira/browse/HUDI-1401
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Presto Integration
> Reporter: Vinoth Chandar
> Assignee: Udit Mehrotra
> Priority: Blocker
> Fix For: 0.7.0
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)