[
https://issues.apache.org/jira/browse/HIVE-17677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Sherman reassigned HIVE-17677:
-------------------------------------
> Investigate using hive statistics information to optimize HoS parallel order
> by
> -------------------------------------------------------------------------------
>
> Key: HIVE-17677
> URL: https://issues.apache.org/jira/browse/HIVE-17677
> Project: Hive
> Issue Type: Improvement
> Affects Versions: 3.0.0
> Reporter: Andrew Sherman
> Assignee: Andrew Sherman
>
> I think Spark's native parallel order by works in a similar way to what we do
> for Hive-on-MR. That is, it scans the RDD once and sample the data to
> determine what ranges the data should be partitioned into, and then scans the
> RDD again to do the actual order by (with multiple reducers).
> One optimization suggested by [~stakiar] is that if we have column stats
> about the col we are ordering by, then the first scan on the RDD is not
> necessary. If we have histogram data about the RDD, we already know what the
> ranges of the order by should be. This should work when running parallel
> order by on simple tables, will be harder when we run it on derived datasets
> (although not impossible).
> To do his we would have to understand more about the internals of
> JavaPairRDD.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)