[
https://issues.apache.org/jira/browse/HIVE-9153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15763302#comment-15763302
]
Rui Li commented on HIVE-9153:
------------------------------
I guess no configuration is suitable for all cases :) If I remember, smaller
"mapreduce.input.fileinputformat.split.maxsize" means more map tasks and is bad
for performance when the data size is relatively big. So increasing it should
help for most cases. Of course users should adjust it according to the cluster
deployment, executor resources etc.
I'm not sure what you mean by performance test JIRAs. We have quite a few JIRAs
to improve performance, and I think each such JIRA involves some simple
performance test to verify the improvement. But I don't remember all of them.
> Perf enhancement on CombineHiveInputFormat and HiveInputFormat
> --------------------------------------------------------------
>
> Key: HIVE-9153
> URL: https://issues.apache.org/jira/browse/HIVE-9153
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Brock Noland
> Assignee: Rui Li
> Fix For: 1.1.0
>
> Attachments: HIVE-9153.1-spark.patch, HIVE-9153.1-spark.patch,
> HIVE-9153.2.patch, HIVE-9153.3.patch, screenshot.PNG
>
>
> The default InputFormat is {{CombineHiveInputFormat}} and thus HOS uses this.
> However, Tez uses {{HiveInputFormat}}. Since tasks are relatively cheap in
> Spark, it might make sense for us to use {{HiveInputFormat}} as well. We
> should evaluate this on a query which has many input splits such as {{select
> count(\*) from store_sales where something is not null}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)