[
https://issues.apache.org/jira/browse/HIVE-14165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423120#comment-15423120
]
Abdullah Yousufi commented on HIVE-14165:
-----------------------------------------
It calls FileSystem.java#listStatus(Path p, PathFilter filter). And that's
correct, it verifies that there is at least one FileStatus under the current
path, at which point it begins the logic of determining splits, primarily by
calling InputFormat#getSplits(JobConf job, int numSplits). But
FileInputFormat#getSplits(JobContext job) is going to call listStatus() anyway.
When I remove this listing, I get a 2x speed increase in a 500 partions S3
table. Could FileInputFormat#getSplits(job) be modified to short-circuit return
a FileNotFound Exception in the cases of a non-existent path and 0 files found,
so that Hive could catch that and continue?
> Enable faster S3 Split Computation
> ----------------------------------
>
> Key: HIVE-14165
> URL: https://issues.apache.org/jira/browse/HIVE-14165
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: 2.1.0
> Reporter: Abdullah Yousufi
> Assignee: Abdullah Yousufi
>
> Split size computation be may improved by the optimizations for listFiles()
> in HADOOP-13208
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)