[
https://issues.apache.org/jira/browse/HIVE-23031?focusedWorklogId=429073&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-429073
]
ASF GitHub Bot logged work on HIVE-23031:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 30/Apr/20 15:07
Start Date: 30/Apr/20 15:07
Worklog Time Spent: 10m
Work Description: jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418081401
##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void
populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
"If the number of references to a CTE clause exceeds this threshold,
Hive will materialize it\n" +
"before executing the main query block. -1 will disable this
feature."),
+ HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+ "Enables query rewrites based on approximate functions(sketches)."),
+
+
HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+ true,
+ "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use
sketch functions."),
+
+ HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+ "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+ new StringSet("hll", "cpc", "theta"),
Review comment:
I understand for a single algorithm it will work. However, consider the
following scenario:
- A user enables BI mode and algorithm `hll`.
- The user creates a MV with count distinct. The MV has stored the count
distinct field using `hll`. The SQL statement still has count distinct.
- We change default algorithm to `cpc` and restart HS2. Thus, when the MV is
loaded by HS2, the count distinct is transformed to `cpc`.
- The user runs a query with count distinct, which transforms to `cpc`,
matches the MV... but fails at deserialization time because the sketch stored
for the MV is `hll`.
That is why I suggested we could limit the options for algorithms till we
have proper support. The risk I see if we do not do that now is that if anyone
creates MVs using the different default algorithms, we will not have any way to
distinguish between them anymore.
From the two choices that you mention above, I was suggesting the second
option, since the main goal of the whole effort is to be able to use these
algorithms seamlessly with the MVs. I agree it can be outside of the scope of
this change, but let's limit the algorithm choices till then?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 429073)
Time Spent: 3h (was: 2h 50m)
> Add option to enable transparent rewrite of count(distinct) into sketch
> functions
> ---------------------------------------------------------------------------------
>
> Key: HIVE-23031
> URL: https://issues.apache.org/jira/browse/HIVE-23031
> Project: Hive
> Issue Type: Sub-task
> Reporter: Zoltan Haindrich
> Assignee: Zoltan Haindrich
> Priority: Major
> Attachments: HIVE-23031.01.patch, HIVE-23031.02.patch,
> HIVE-23031.03.patch, HIVE-23031.03.patch, HIVE-23031.03.patch,
> HIVE-23031.04.patch, HIVE-23031.04.patch
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)