[
https://issues.apache.org/jira/browse/HIVE-22964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051924#comment-17051924
]
Peter Vary commented on HIVE-22964:
-----------------------------------
Hi [~aditya-shah],
There are multiple places where similar parallelization happens. See for
example HIVE-22832.
What do you think about reusing the HIVE_MOVE_FILES_THREAD_COUNT configuration
value for this as well? I know this is not ideal, but I see this config reused
multiple times where we want to parallelize the file access/checks.
Also if there is an error when accessing one of the files, the original
solution stops immediately, while the new solution will try to access all of
the files - this could be problematic for tables on S3 with great number of
files. (HIVE-22832 solves this as well)
Thanks,
Peter
> MM table split computation is very slow
> ---------------------------------------
>
> Key: HIVE-22964
> URL: https://issues.apache.org/jira/browse/HIVE-22964
> Project: Hive
> Issue Type: Improvement
> Reporter: Aditya Shah
> Assignee: Aditya Shah
> Priority: Major
> Attachments: HIVE-22964.patch
>
>
> Since for MM table we process the paths prior to inputFormat.getSplits() we
> end up doing listing on the whole table at once. This could be optimized.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)