[jira] [Created] (HIVE-15785) Add S3 support for druid storage handler
slim bouguerra created HIVE-15785: - Summary: Add S3 support for druid storage handler Key: HIVE-15785 URL: https://issues.apache.org/jira/browse/HIVE-15785 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Fix For: 2.2.0 Add S3 support for druid storage handler -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15809) Typo in the PostgreSQL database name for druid service
slim bouguerra created HIVE-15809: - Summary: Typo in the PostgreSQL database name for druid service Key: HIVE-15809 URL: https://issues.apache.org/jira/browse/HIVE-15809 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra Priority: Trivial Fix For: 2.2.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15727) Add pre insert work to give storage handler the possibility to perform pre insert checking
slim bouguerra created HIVE-15727: - Summary: Add pre insert work to give storage handler the possibility to perform pre insert checking Key: HIVE-15727 URL: https://issues.apache.org/jira/browse/HIVE-15727 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 2.2.0 Add pre insert work stage to give storage handler the possibility to perform pre insert checking. For instance for the druid storage handler this will block the statement INSERT INTO statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15951) Make sure base persist directory is unique and deleted
slim bouguerra created HIVE-15951: - Summary: Make sure base persist directory is unique and deleted Key: HIVE-15951 URL: https://issues.apache.org/jira/browse/HIVE-15951 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Priority: Critical Fix For: 2.2.0 In some cases the base persist directory will contain old data or shared between reducer in the same physical VM. That will lead to the failure of the job till that the directory is cleaned. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16025) Where IN clause throws exception
slim bouguerra created HIVE-16025: - Summary: Where IN clause throws exception Key: HIVE-16025 URL: https://issues.apache.org/jira/browse/HIVE-16025 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Priority: Critical {code} select * from login_druid where userid IN ("user1", "user2"); Exception in thread "main" java.lang.AssertionError: cannot translate filter: IN($1, _UTF-16LE'user1', _UTF-16LE'user2') at org.apache.calcite.adapter.druid.DruidQuery$Translator.translateFilter(DruidQuery.java:886) at org.apache.calcite.adapter.druid.DruidQuery$Translator.access$000(DruidQuery.java:786) at org.apache.calcite.adapter.druid.DruidQuery.getQuery(DruidQuery.java:424) at org.apache.calcite.adapter.druid.DruidQuery.deriveQuerySpec(DruidQuery.java:402) at org.apache.calcite.adapter.druid.DruidQuery.getQuerySpec(DruidQuery.java:351) at org.apache.calcite.adapter.druid.DruidQuery.deriveRowType(DruidQuery.java:271) at org.apache.calcite.rel.AbstractRelNode.getRowType(AbstractRelNode.java:219) at org.apache.calcite.plan.RelOptUtil.verifyTypeEquivalence(RelOptUtil.java:343) at org.apache.calcite.plan.hep.HepRuleCall.transformTo(HepRuleCall.java:57) at org.apache.calcite.plan.RelOptRuleCall.transformTo(RelOptRuleCall.java:225) at org.apache.calcite.adapter.druid.DruidRules$DruidFilterRule.onMatch(DruidRules.java:142) at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:314) at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:502) at org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:381) at org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:247) at org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:125) at org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:206) at org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:193) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.hepPlan(CalcitePlanner.java:1775) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1504) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1260) at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:113) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:997) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:149) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:106) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1068) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1084) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:363) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11026) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:285) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:511) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:233) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16026) Generated query will timeout and/or kill the druid cluster.
slim bouguerra created HIVE-16026: - Summary: Generated query will timeout and/or kill the druid cluster. Key: HIVE-16026 URL: https://issues.apache.org/jira/browse/HIVE-16026 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Grouping by `__time` and another dimension generate a query with granularity NONE with an interval from 1970 to 3000. This will kill the druid cluster because druid group by strategy will create cursor for every ms and there is lot of milliseconds between 1970 and 3000. Hence such query can turn into a select then do the group by within hive. This should only happen when we don't know the `__time` granularity. {code} explain select `__time`, userid from login_druid group by `__time`, userid > ; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1"] TableScan [TS_0] Output:["__time","userid"],properties:{"druid.query.json":"{\"queryType\":\"groupBy\",\"dataSource\":\"druid_user_login\",\"granularity\":\"NONE\",\"dimensions\":[\"userid\"],\"limitSpec\":{\"type\":\"default\"},\"aggregations\":[{\"type\":\"longSum\",\"name\":\"dummy_agg\",\"fieldName\":\"dummy_agg\"}],\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"]}","druid.query.type":"groupBy"} {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15877) Upload dependency jars for druid storage handler
slim bouguerra created HIVE-15877: - Summary: Upload dependency jars for druid storage handler Key: HIVE-15877 URL: https://issues.apache.org/jira/browse/HIVE-15877 Project: Hive Issue Type: Bug Reporter: slim bouguerra Upload dependency jars for druid storage handler -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15277) Teach Hive how to create/delete Druid segments
slim bouguerra created HIVE-15277: - Summary: Teach Hive how to create/delete Druid segments Key: HIVE-15277 URL: https://issues.apache.org/jira/browse/HIVE-15277 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra We want to extend the DruidStorageHandler to support CTAS queries. In this implementation Hive will generate druid segment files and insert the metadata to signal the handoff to druid. The syntax will be as follows: CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "datasourcename") AS ; This statement stores the results of query in a Druid datasource named 'datasourcename'. One of the columns of the query needs to be the time dimension, which is mandatory in Druid. In particular, we use the same convention that it is used for Druid: there needs to be a the column named '__time' in the result of the executed query, which will act as the time dimension column in Druid. Currently, the time column dimension needs to be a 'timestamp' type column. metrics can be of type long, double and float while dimensions are strings. Keep in mind that druid has a clear separation between dimensions and metrics, therefore if you have a column in hive that contains number and need to be presented as dimension use the cast operator to cast as string. This initial implementation interacts with Druid Meta data storage to add/remove the table in druid, user need to supply the meta data config as --hiveconf hive.druid.metadata.password=XXX --hiveconf hive.druid.metadata.username=druid --hiveconf hive.druid.metadata.uri=jdbc:mysql://host/druid -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15273) Http Client not configured correctly
slim bouguerra created HIVE-15273: - Summary: Http Client not configured correctly Key: HIVE-15273 URL: https://issues.apache.org/jira/browse/HIVE-15273 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Priority: Minor Current used http client by the druid-hive record reader is constructed with default values. Default values of numConnection and ReadTimeout are very small which can lead to following exception " ERROR [2ee34a2b-c8a5-4748-ab91-db3621d2aa5c main] CliDriver: Failed with exception java.io.IOException:java.io.IOException: java.io.IOException: org.apache.h ive.druid.org.jboss.netty.channel.ChannelException: Channel disconnected" Full stack can be found here.https://gist.github.com/b-slim/384ca6a96698f5b51ad9b171cff556a2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15274) wrong results on the column __time
slim bouguerra created HIVE-15274: - Summary: wrong results on the column __time Key: HIVE-15274 URL: https://issues.apache.org/jira/browse/HIVE-15274 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez Priority: Minor issuing select * from table will return wrong time column. expected results ─┬┬─┐ │ __time │ dimension1 │ metric1 │ ├─┼┼─┤ │ Wed Dec 31 2014 16:00:00 GMT-0800 (PST) │ value1 │ 1 │ │ Wed Dec 31 2014 16:00:00 GMT-0800 (PST) │ value1.1 │ 1 │ │ Sun May 31 2015 19:00:00 GMT-0700 (PDT) │ value2 │ 20.5│ │ Sun May 31 2015 19:00:00 GMT-0700 (PDT) │ value2.1 │ 32 │ └─┴┴─┘ returned result 2014-12-31 19:00:00 value1 1.0 2014-12-31 19:00:00 value1.11.0 2014-12-31 19:00:00 value2 20.5 2014-12-31 19:00:00 value2.132.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15393) Update Guava version
slim bouguerra created HIVE-15393: - Summary: Update Guava version Key: HIVE-15393 URL: https://issues.apache.org/jira/browse/HIVE-15393 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Priority: Blocker Druid base code is using newer version of guava 16.0.1 that is not compatible with the current version used by Hive. FYI Hadoop project is moving to Guava 18 not sure if it is better to move to guava 18 or even 19. https://issues.apache.org/jira/browse/HADOOP-10101 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15439) Support INSERT OVERWRITE for internal druid datasources.
slim bouguerra created HIVE-15439: - Summary: Support INSERT OVERWRITE for internal druid datasources. Key: HIVE-15439 URL: https://issues.apache.org/jira/browse/HIVE-15439 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Assignee: slim bouguerra Add support for SQL statement INSERT OVERWRITE TABLE druid_internal_table. In order to add this support will need to add new post insert hook to update the druid metadata. Creation of the segment will be the same as CTAS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15571) Support Insert into for druid storage handler
slim bouguerra created HIVE-15571: - Summary: Support Insert into for druid storage handler Key: HIVE-15571 URL: https://issues.apache.org/jira/browse/HIVE-15571 Project: Hive Issue Type: New Feature Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15586) Make Insert and Create statement Transactional
slim bouguerra created HIVE-15586: - Summary: Make Insert and Create statement Transactional Key: HIVE-15586 URL: https://issues.apache.org/jira/browse/HIVE-15586 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Currently insert/create will return the handle to user without waiting for the data been loaded by the druid cluster. In order to avoid that will add a passive wait till the segment are loaded by historical in case the coordinator is UP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-16210) Use jvm temporary tmp dir by default
slim bouguerra created HIVE-16210: - Summary: Use jvm temporary tmp dir by default Key: HIVE-16210 URL: https://issues.apache.org/jira/browse/HIVE-16210 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra instead of using "/tmp" by default, it makes more sense to use the jvm default tmp dir. This can have dramatic consequences if the indexed files are huge. For instance application run by run containers can be provisioned with a dedicated tmp dir. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16371) Add bitmap selection strategy for druid storage handler
slim bouguerra created HIVE-16371: - Summary: Add bitmap selection strategy for druid storage handler Key: HIVE-16371 URL: https://issues.apache.org/jira/browse/HIVE-16371 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Currently only Concise Bitmap strategy is supported. This Pr is to make Roaring bitmap encoding the default and Concise optional if needed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16404) Renaming of public classes in Calcite 12 breeaking druid integration
slim bouguerra created HIVE-16404: - Summary: Renaming of public classes in Calcite 12 breeaking druid integration Key: HIVE-16404 URL: https://issues.apache.org/jira/browse/HIVE-16404 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 2.2.0 Reporter: slim bouguerra Fix For: 3.0.0 Changes to names in the druid rules is backward incompatible with current implementation. https://github.com/apache/calcite/commit/a89c62cd6d6cc181c90881afa0bf099746739a91 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16482) Druid Ser/Desr need to use dimension output name in order to function with Extraction function
slim bouguerra created HIVE-16482: - Summary: Druid Ser/Desr need to use dimension output name in order to function with Extraction function Key: HIVE-16482 URL: https://issues.apache.org/jira/browse/HIVE-16482 Project: Hive Issue Type: Bug Reporter: slim bouguerra Druid Ser/Desr need to use dimension output name in order to function with Extraction function. Some part of the Ser/Desr code uses the method {code} DimensionSpec.getDimension(){code} although when extraction function are in game the name of the dimension will be defined by {code}DimensionSpec.getOutputName() {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16149) Druid query path fails when using LLAP mode
slim bouguerra created HIVE-16149: - Summary: Druid query path fails when using LLAP mode Key: HIVE-16149 URL: https://issues.apache.org/jira/browse/HIVE-16149 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: Ashutosh Chauhan {code} hive> select i_item_desc ,i_category ,i_class ,i_current_price ,i_item_id ,sum(ss_ext_sales_price) > as itemrevenue ,sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratio > from tpcds_store_sales_sold_time_1000_day_all > where (i_category ='Jewelry' or i_category = 'Sports' or i_category ='Books') and `__time` >= cast('2001-01-12' as date) and `__time` <= cast('2001-02-11' as date) > group by i_item_id ,i_item_desc ,i_category ,i_class ,i_current_price order by i_category ,i_class ,i_item_id ,i_item_desc ,revenueratio limit 10; Query ID = sbouguerra_20170308131436_225330b7-1142-4e4e-a05a-46ef544c8ee8 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1488231257387_1862) -- VERTICES MODESTATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -- Map 1 llapINITED 1 001 0 0 Reducer 2 llapINITED 2 002 0 0 Reducer 3 llapINITED 1 001 0 0 -- VERTICES: 00/03 [>>--] 0%ELAPSED TIME: 59.68 s -- Status: Failed Dag received [DAG_TERMINATE, SERVICE_PLUGIN_ERROR] in RUNNING state. Error reported by TaskScheduler [[2:LLAP]][SERVICE_UNAVAILABLE] No LLAP Daemons are running Vertex killed, vertexName=Reducer 3, vertexId=vertex_1488231257387_1862_3_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_02 [Reducer 3] killed/failed due to:DAG_TERMINATED] Vertex killed, vertexName=Reducer 2, vertexId=vertex_1488231257387_1862_3_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:2, Vertex vertex_1488231257387_1862_3_01 [Reducer 2] killed/failed due to:DAG_TERMINATED] Vertex killed, vertexName=Map 1, vertexId=vertex_1488231257387_1862_3_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_00 [Map 1] killed/failed due to:DAG_TERMINATED] DAG did not succeed due to SERVICE_PLUGIN_ERROR. failedVertices:0 killedVertices:3 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Dag received [DAG_TERMINATE, SERVICE_PLUGIN_ERROR] in RUNNING state.Error reported by TaskScheduler [[2:LLAP]][SERVICE_UNAVAILABLE] No LLAP Daemons are runningVertex killed, vertexName=Reducer 3, vertexId=vertex_1488231257387_1862_3_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_02 [Reducer 3] killed/failed due to:DAG_TERMINATED]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1488231257387_1862_3_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:2, Vertex vertex_1488231257387_1862_3_01 [Reducer 2] killed/failed due to:DAG_TERMINATED]Vertex killed, vertexName=Map 1, vertexId=vertex_1488231257387_1862_3_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:1, Vertex vertex_1488231257387_1862_3_00 [Map 1] killed/failed due to:DAG_TERMINATED]DAG did not succeed due to SERVICE_PLUGIN_ERROR. failedVertices:0 killedVertices:3 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16124) Drop the segments data as soon it is pushed to HDFS
slim bouguerra created HIVE-16124: - Summary: Drop the segments data as soon it is pushed to HDFS Key: HIVE-16124 URL: https://issues.apache.org/jira/browse/HIVE-16124 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Drop the pushed segments from the indexer as soon as the HDFS push is done. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16123) Let user chose the granularity of bucketing.
slim bouguerra created HIVE-16123: - Summary: Let user chose the granularity of bucketing. Key: HIVE-16123 URL: https://issues.apache.org/jira/browse/HIVE-16123 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Currently we index the data with granularity of none which puts lot of pressure on the indexer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16122) NPE Hive Druid split introduced by HIVE-15928
slim bouguerra created HIVE-16122: - Summary: NPE Hive Druid split introduced by HIVE-15928 Key: HIVE-16122 URL: https://issues.apache.org/jira/browse/HIVE-16122 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16126) push all the time extraction to druid
slim bouguerra created HIVE-16126: - Summary: push all the time extraction to druid Key: HIVE-16126 URL: https://issues.apache.org/jira/browse/HIVE-16126 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra currently we don't push most of the time extractions to druid which leads to selecting all the data, bad!. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16125) Split work between reducers.
slim bouguerra created HIVE-16125: - Summary: Split work between reducers. Key: HIVE-16125 URL: https://issues.apache.org/jira/browse/HIVE-16125 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Split work between reducer. currently we have one reducer per segment granularity even if the interval will be partitioned over multiple partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16095) Filter generation is not taking into account the column type.
slim bouguerra created HIVE-16095: - Summary: Filter generation is not taking into account the column type. Key: HIVE-16095 URL: https://issues.apache.org/jira/browse/HIVE-16095 Project: Hive Issue Type: Bug Reporter: slim bouguerra We are suppose to get alphanumeric comparison when we have a cast to numeric type. This looks like to be a calcite issue. {code} hive> explain select * from login_druid where userid < 2 > ; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1","_col2"] TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} Time taken: 1.548 seconds, Fetched: 10 row(s) hive> explain select * from login_druid where cast (userid as int) < 2; OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] Output:["_col0","_col1","_col2"] TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"filter\":{\"type\":\"bound\",\"dimension\":\"userid\",\"upper\":\"2\",\"upperStrict\":true,\"alphaNumeric\":false},\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} Time taken: 0.27 seconds, Fetched: 10 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16096) Predicate `__time` In ("date", "date") is not pused
slim bouguerra created HIVE-16096: - Summary: Predicate `__time` In ("date", "date") is not pused Key: HIVE-16096 URL: https://issues.apache.org/jira/browse/HIVE-16096 Project: Hive Issue Type: Bug Reporter: slim bouguerra {code} explain select * from login_druid where `__time` in ("2003-1-1", "2004-1-1" ); OK Plan optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_2] Output:["_col0","_col1","_col2"] Filter Operator [FIL_4] predicate:(__time) IN ('2003-1-1', '2004-1-1') TableScan [TS_0] Output:["__time","userid","num_l"],properties:{"druid.query.json":"{\"queryType\":\"select\",\"dataSource\":\"druid_user_login\",\"descending\":false,\"intervals\":[\"1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z\"],\"dimensions\":[\"userid\"],\"metrics\":[\"num_l\"],\"granularity\":\"all\",\"pagingSpec\":{\"threshold\":16384},\"context\":{\"druid.query.fetch\":false}}","druid.query.type":"select"} {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16519) Fix exception thrown by checkOutputSpecs
slim bouguerra created HIVE-16519: - Summary: Fix exception thrown by checkOutputSpecs Key: HIVE-16519 URL: https://issues.apache.org/jira/browse/HIVE-16519 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra do not throw exception by checkOutputSpecs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-17302) ReduceRecordSource should not add batch string to Exception message
slim bouguerra created HIVE-17302: - Summary: ReduceRecordSource should not add batch string to Exception message Key: HIVE-17302 URL: https://issues.apache.org/jira/browse/HIVE-17302 Project: Hive Issue Type: Bug Reporter: slim bouguerra ReduceRecordSource is adding the batch data as a string to the exception stack, this can lead to an OOM of the Query AM when the query fails due to other issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17303) Missmatch between roaring bitmap library used by druid and the one coming from tez
slim bouguerra created HIVE-17303: - Summary: Missmatch between roaring bitmap library used by druid and the one coming from tez Key: HIVE-17303 URL: https://issues.apache.org/jira/browse/HIVE-17303 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra {code} Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.roaringbitmap.buffer.MutableRoaringBitmap.runOptimize()Z at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) at org.apache.hive.druid.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at org.apache.hadoop.hive.druid.io.DruidRecordWriter.pushSegments(DruidRecordWriter.java:165) ... 25 more Caused by: java.lang.NoSuchMethodError: org.roaringbitmap.buffer.MutableRoaringBitmap.runOptimize()Z at org.apache.hive.druid.com.metamx.collections.bitmap.WrappedRoaringBitmap.toImmutableBitmap(WrappedRoaringBitmap.java:65) at org.apache.hive.druid.com.metamx.collections.bitmap.RoaringBitmapFactory.makeImmutableBitmap(RoaringBitmapFactory.java:88) at org.apache.hive.druid.io.druid.segment.StringDimensionMergerV9.writeIndexes(StringDimensionMergerV9.java:348) at org.apache.hive.druid.io.druid.segment.IndexMergerV9.makeIndexFiles(IndexMergerV9.java:218) at org.apache.hive.druid.io.druid.segment.IndexMerger.merge(IndexMerger.java:438) at org.apache.hive.druid.io.druid.segment.IndexMerger.persist(IndexMerger.java:186) at org.apache.hive.druid.io.druid.segment.IndexMerger.persist(IndexMerger.java:152) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl.persistHydrant(AppenderatorImpl.java:996) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl.access$200(AppenderatorImpl.java:93) at org.apache.hive.druid.io.druid.segment.realtime.appenderator.AppenderatorImpl$2.doCall(AppenderatorImpl.java:385) at org.apache.hive.druid.io.druid.common.guava.ThreadRenamingCallable.call(ThreadRenamingCallable.java:44) ... 4 more ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:89, Vertex vertex_1502470020457_0005_12_05 [Reducer 2] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2) Options Attachments {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17160) Adding kerberos Authorization to the Druid hive integration
slim bouguerra created HIVE-17160: - Summary: Adding kerberos Authorization to the Druid hive integration Key: HIVE-17160 URL: https://issues.apache.org/jira/browse/HIVE-17160 Project: Hive Issue Type: New Feature Components: Druid integration Reporter: slim bouguerra This goal of this feature is to allow hive querying a secured druid cluster using kerberos credentials. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16522) Hive is query timer is not keeping track of the fetch task execution
slim bouguerra created HIVE-16522: - Summary: Hive is query timer is not keeping track of the fetch task execution Key: HIVE-16522 URL: https://issues.apache.org/jira/browse/HIVE-16522 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Currently Hive CLI query execution time does not include fetch time execution. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-17372) update druid dependency to druid 0.10.1
slim bouguerra created HIVE-17372: - Summary: update druid dependency to druid 0.10.1 Key: HIVE-17372 URL: https://issues.apache.org/jira/browse/HIVE-17372 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Update to most recent druid version to be released August 23. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-16816) Chained Group by support for druid.
slim bouguerra created HIVE-16816: - Summary: Chained Group by support for druid. Key: HIVE-16816 URL: https://issues.apache.org/jira/browse/HIVE-16816 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra This is more likely to be a calcite enhancement but am logging it here to track it any way. Currently queries like {code} select count (distinct dim) from table {code} is pushed partially to druid as group by dim followed by a count executed by hive QE. This can be enhanced by using the nested (eg chained execution) group by query such as the first (inner) GB query does group by key and the second (outer) does the count. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16588) Ressource leak by druid http client
slim bouguerra created HIVE-16588: - Summary: Ressource leak by druid http client Key: HIVE-16588 URL: https://issues.apache.org/jira/browse/HIVE-16588 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Fix For: 3.0.0 Current implementation of druid storage handler does leak some resources if the creation of the http client fails due to too many files exception. The reason this is leaking is the fact the cleaning hook is registered after the client starts. In order to fix this will extract the creation of the HTTP client to become static and reusable instead of per query creation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-17581) Replace some calcite dependencies with native ones
slim bouguerra created HIVE-17581: - Summary: Replace some calcite dependencies with native ones Key: HIVE-17581 URL: https://issues.apache.org/jira/browse/HIVE-17581 Project: Hive Issue Type: Sub-task Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: slim bouguerra This is a followup of HIVE-17468. This patch excludes some unwanted druid-calcite dependencies. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17582) Followup of HIVE-15708
slim bouguerra created HIVE-17582: - Summary: Followup of HIVE-15708 Key: HIVE-17582 URL: https://issues.apache.org/jira/browse/HIVE-17582 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: slim bouguerra HIVE-15708 commit be59e024420ed5ca970e87a6dec402fecee21f06 introduced some unwanted bugs it did change the following code org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat#169 {code} builder.intervals(Arrays.asList(DruidTable.DEFAULT_INTERVAL)); {code} with {code} final List intervals = Arrays.asList(); builder.intervals(intervals); {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17468) Shade and package appropriate jackson version for druid storage handler
slim bouguerra created HIVE-17468: - Summary: Shade and package appropriate jackson version for druid storage handler Key: HIVE-17468 URL: https://issues.apache.org/jira/browse/HIVE-17468 Project: Hive Issue Type: Bug Reporter: slim bouguerra Fix For: 3.0.0 Currently we are excluding all the jackson core dependencies coming from druid. This is wrong in my opinion since this will lead to the packaging of unwanted jackson library from other projects. As you can see the file hive-druid-deps.txt currently jacskon core is coming from calcite and the version is 2.6.3 which is very different from 2.4.6 used by druid. This patch exclude the unwanted jars and make sure to bring in druid jackson dependency from druid it self. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17523) Insert into druid table hangs Hive server2 in an infinit loop
slim bouguerra created HIVE-17523: - Summary: Insert into druid table hangs Hive server2 in an infinit loop Key: HIVE-17523 URL: https://issues.apache.org/jira/browse/HIVE-17523 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Inserting data via insert into table backed by druid can lead to a Hive server hang. This is due to some bug in the naming of druid segments partitions. To reproduce the issue {code} drop table login_hive; create table login_hive(`timecolumn` timestamp, `userid` string, `num_l` double); insert into login_hive values ('2015-01-01 00:00:00', 'user1', 5); insert into login_hive values ('2015-01-01 01:00:00', 'user2', 4); insert into login_hive values ('2015-01-01 02:00:00', 'user3', 2); insert into login_hive values ('2015-01-02 00:00:00', 'user1', 1); insert into login_hive values ('2015-01-02 01:00:00', 'user2', 2); insert into login_hive values ('2015-01-02 02:00:00', 'user3', 8); insert into login_hive values ('2015-01-03 00:00:00', 'user1', 5); insert into login_hive values ('2015-01-03 01:00:00', 'user2', 9); insert into login_hive values ('2015-01-03 04:00:00', 'user3', 2); insert into login_hive values ('2015-03-09 00:00:00', 'user3', 5); insert into login_hive values ('2015-03-09 01:00:00', 'user1', 0); insert into login_hive values ('2015-03-09 05:00:00', 'user2', 0); drop table login_druid; CREATE TABLE login_druid STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "druid_login_test_tmp", "druid.segment.granularity" = "DAY", "druid.query.granularity" = "HOUR") AS select `timecolumn` as `__time`, `userid`, `num_l` FROM login_hive; select * FROM login_druid; insert into login_druid values ('2015-03-09 05:00:00', 'user4', 0); {code} This patch unifies the logic of pushing and segments naming by using Druid data segment pusher as much as possible. This patch also has some minor code refactoring and test enhancements. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17653) Druid storage handler CTAS with boolean type columns fails.
slim bouguerra created HIVE-17653: - Summary: Druid storage handler CTAS with boolean type columns fails. Key: HIVE-17653 URL: https://issues.apache.org/jira/browse/HIVE-17653 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Fix For: 3.0.0 Druid storage handler CTAS fails with the exception below when a Boolean column is included. A simple workaround would be to add a cast to string over the boolean column, this will lead to index the column as a druid dimension with value `true` or `false`. {code} ERROR : Status: Failed ERROR : Vertex failed, vertexName=Reducer 3, vertexId=vertex_1506230948023_0005_9_02, diagnostics=[Task failed, taskId=task_1506230948023_0005_9_02_03, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1506230948023_0005_9_02_03_0:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:218) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:172) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:110) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:406) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:248) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:319) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:189) ... 15 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing vector batch (tag=0) (vectorizedVertexNum 2) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:492) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecordVector(ReduceRecordSource.java:397) ... 18 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Dimension bo does not have STRING type: BOOLEAN at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:564) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:664) at org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:101) at org.apache.hadoop.hive.ql.exec.Operator.baseForward(Operator.java:955) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:903) at org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:145) at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:479) ... 19 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Dimension bo does not have STRING type: BOOLEAN at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:272) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:609) at
[jira] [Created] (HIVE-17623) Fix Select query Fix Double column serde and some refactoring
slim bouguerra created HIVE-17623: - Summary: Fix Select query Fix Double column serde and some refactoring Key: HIVE-17623 URL: https://issues.apache.org/jira/browse/HIVE-17623 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra This PR has 2 fixes. First, fixes the limit of results returned by Select query that used to be limited to 16K rows Second fixes the type inference for the double type newly added to druid. Use Jackson polymorphism to infer types and parse results from druid nodes. Removes duplicate codes form RecordReaders. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17627) Use druid scan query instead of the select query.
slim bouguerra created HIVE-17627: - Summary: Use druid scan query instead of the select query. Key: HIVE-17627 URL: https://issues.apache.org/jira/browse/HIVE-17627 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra The biggest difference between select query and scan query is that, scan query doesn't retain all rows in memory before rows can be returned to client. It will cause memory pressure if too many rows required by select query. Scan query doesn't have this issue. Scan query can return all rows without issuing another pagination query, which is extremely useful when query against historical or realtime node directly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18156) Provide smooth migration path for CTAS when time column is not with timezone
slim bouguerra created HIVE-18156: - Summary: Provide smooth migration path for CTAS when time column is not with timezone Key: HIVE-18156 URL: https://issues.apache.org/jira/browse/HIVE-18156 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez Currently, default recommend CTAS and most legacy documentation does not specify that __time column needs to be with timezone. Thus the CTAS will fail with {code} 2017-11-27T17:13:10,241 ERROR [e5f708c8-df4e-41a4-b8a1-d18ac13123d2 main] ql.Driver: FAILED: SemanticException No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type org.apache.hadoop.hive.ql.parse.SemanticException: No column with timestamp with local time-zone type on query result; one column should be of timestamp with local time-zone type at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer$SortedDynamicPartitionProc.getGranularitySelOp(SortedDynPartitionTimeGranularityOptimizer.java:242) at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer$SortedDynamicPartitionProc.process(SortedDynPartitionTimeGranularityOptimizer.java:163) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:158) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.optimizer.SortedDynPartitionTimeGranularityOptimizer.transform(SortedDynPartitionTimeGranularityOptimizer.java:103) at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:250) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11683) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:298) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:592) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1589) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1356) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1346) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:187) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:342) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1300) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1274) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:173) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver(TestMiniDruidCliDriver.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:92) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at
[jira] [Created] (HIVE-18196) Druid Mini Cluster to run Qtests integrations tests.
slim bouguerra created HIVE-18196: - Summary: Druid Mini Cluster to run Qtests integrations tests. Key: HIVE-18196 URL: https://issues.apache.org/jira/browse/HIVE-18196 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: Ashutosh Chauhan The overall Goal of this is to add a new Module that can fork a druid cluster to run integration testing as part of the Mini Clusters Qtest suite. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18197) Fix issue with wrong segments identifier usage.
slim bouguerra created HIVE-18197: - Summary: Fix issue with wrong segments identifier usage. Key: HIVE-18197 URL: https://issues.apache.org/jira/browse/HIVE-18197 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra We have 2 different issues, that can make checking of load status fail for druid segments. issues are due to usage of wrong segment identifier at couple of locations. # We are construction the segment identifier with UTC timezone, which can be wrong if the segments we built in a different timezone. The way to fix this is to use the segment identifier instead of re-making it at the client side. # We are using outdate segments identifiers for the INSERT INTO case. The way to fix this is to use the segment metadata produced by the metadata commit phase. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18226) handle UDF to double/int over aggregate
slim bouguerra created HIVE-18226: - Summary: handle UDF to double/int over aggregate Key: HIVE-18226 URL: https://issues.apache.org/jira/browse/HIVE-18226 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra In cases like the following query Hive planner adds extra UDFtoDouble over integer columns. This kind of udf can be pushed to Druid as DoubleSum instead of LongSum and vice versa. {code} PREHOOK: query: EXPLAIN SELECT floor_year(`__time`), SUM(ctinyint)/ count(*) FROM druid_table GROUP BY floor_year(`__time`) PREHOOK: type: QUERY POSTHOOK: query: EXPLAIN SELECT floor_year(`__time`), SUM(ctinyint)/ count(*) FROM druid_table GROUP BY floor_year(`__time`) POSTHOOK: type: QUERY STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: druid_table properties: druid.query.json {"queryType":"timeseries","dataSource":"default.druid_table","descending":false,"granularity":"year","aggregations":[{"type":"longSum","name":"$f1","fieldName":"ctinyint"},{"type":"count","name":"$f2"}],"intervals":["1900-01-01T00:00:00.000/3000-01-01T00:00:00.000"],"context":{"skipEmptyBuckets":true}} druid.query.type timeseries Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: __time (type: timestamp with local time zone), (UDFToDouble($f1) / UDFToDouble($f2)) (type: double) outputColumnNames: _col0, _col1 Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 9173 Data size: 0 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-18254) Use proper AVG Calcite primitive instead of Other_FUNCTION
slim bouguerra created HIVE-18254: - Summary: Use proper AVG Calcite primitive instead of Other_FUNCTION Key: HIVE-18254 URL: https://issues.apache.org/jira/browse/HIVE-18254 Project: Hive Issue Type: Bug Reporter: slim bouguerra Currently Hive-Calcite operator tree treats AVG function as an unknown function that has a Calcite Sql Kind of Other_FUNCTION. This is an issue that can get into the way of rules like {{org.apache.calcite.rel.rules.AggregateReduceFunctionsRule}}. This patch adds the avg function to the list of known aggregate function. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17871) Add non nullability flag to druid time column
slim bouguerra created HIVE-17871: - Summary: Add non nullability flag to druid time column Key: HIVE-17871 URL: https://issues.apache.org/jira/browse/HIVE-17871 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Druid time column is non null all the time. Adding the non nullability flag will enable extra calcite goodness like transforming {code} select count(`__time`) from table {code} to {code} select count(*) from table {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-19443) Issue with Druid timestamp with timezone handling
slim bouguerra created HIVE-19443: - Summary: Issue with Druid timestamp with timezone handling Key: HIVE-19443 URL: https://issues.apache.org/jira/browse/HIVE-19443 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Attachments: test_resutls.out, test_timestamp.q As you can see at the attached file [^test_resutls.out] when switching current timezone to UTC the insert of values from Hive table into Druid table does miss some rows. You can use this to reproduce it. [^test_timestamp.q] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19474) Decimal type should be casted as part of the CTAS or INSERT Clause.
slim bouguerra created HIVE-19474: - Summary: Decimal type should be casted as part of the CTAS or INSERT Clause. Key: HIVE-19474 URL: https://issues.apache.org/jira/browse/HIVE-19474 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra HIVE-18569 introduced a runtime config variable to allow the indexing of Decimal as Double, this leads to kind of messy state, Hive metadata think the column is still decimal while it is stored as double. Since the Hive metadata of the column is Decimal the logical optimizer will not push down aggregates. i tried to fix this by adding some logic to the application but it makes the code very clumsy with lot of branches. Instead i propose to revert this patch and let the user introduce an explicit cast this will be better since the metada reflects actual storage type and push down aggregates will kick in and there is no config needed. cc [~ashutoshc] and [~nishantbangarwa] You can see the difference with the following DDL {code} create table test_base_table(`timecolumn` timestamp, `interval_marker` string, `num_l` DECIMAL(10,2)); insert into test_base_table values ('2015-03-08 00:00:00', 'i1-start', 4.5); set hive.druid.approx.result=true; CREATE TABLE druid_test_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY") AS select cast(`timecolumn` as timestamp with local time zone) as `__time`, `interval_marker`, cast(`num_l` as double) FROM test_base_table; describe druid_test_table; explain select sum(num_l), min(num_l) FROM druid_test_table; CREATE TABLE druid_test_table_2 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY") AS select cast(`timecolumn` as timestamp with local time zone) as `__time`, `interval_marker`, `num_l` FROM test_base_table; describe druid_test_table_2; explain select sum(num_l), min(num_l) FROM druid_test_table_2; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19490) Locking on Insert into for non native and managed tables.
slim bouguerra created HIVE-19490: - Summary: Locking on Insert into for non native and managed tables. Key: HIVE-19490 URL: https://issues.apache.org/jira/browse/HIVE-19490 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Current state of the art: Managed non native table like Druid Tables, will need to get a Lock on Insert into or insert Over write. The nature of this lock is set to Exclusive by default for any non native table. This implies that Inserts into Druid table will Lock any read query as well during the execution of the insert into. IMO this lock (on insert into) is not needed since the insert statement is appending data and the state of loading it is managed partially by Hive Storage handler hook and part of it by Druid. What i am proposing is to relax the lock level to shared for all non native tables on insert into operations and keep it as Exclusive Write for insert Overwrite for now. Any feedback is welcome. cc [~ekoifman] / [~ashutoshc] / [~jdere] / [~hagleitn] Also am not sure what is the best way to unit test this currently am using debugger to check of locks are what i except, please let me know if there is a better way to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19441) Add support for float aggregator and use LLAP test Driver
slim bouguerra created HIVE-19441: - Summary: Add support for float aggregator and use LLAP test Driver Key: HIVE-19441 URL: https://issues.apache.org/jira/browse/HIVE-19441 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Adding support to the float kind aggregator. Use LLAP as test Driver to reduce execution time of tests from about 2 hours to 15 min: Although this patches unveiling an issue with timezone, maybe it is fixed by [~jcamachorodriguez] upcoming set of patches. Before {code} [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.6.1:testCompile (default-testCompile) @ hive-it-qfile --- [INFO] Compiling 21 source files to /Users/sbouguerra/Hdev/hive/itests/qtest/target/test-classes [INFO] [INFO] --- maven-surefire-plugin:2.21.0:test (default-test) @ hive-it-qfile --- [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] Running org.apache.hadoop.hive.cli.TestMiniDruidCliDriver [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6,654.117 s - in org.apache.hadoop.hive.cli.TestMiniDruidCliDriver [INFO] [INFO] Results: [INFO] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:51 h [INFO] Finished at: 2018-05-04T12:43:19-07:00 [INFO] {code} After {code} INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.6.1:testCompile (default-testCompile) @ hive-it-qfile --- [INFO] Compiling 22 source files to /Users/sbouguerra/Hdev/hive/itests/qtest/target/test-classes [INFO] [INFO] --- maven-surefire-plugin:2.21.0:test (default-test) @ hive-it-qfile --- [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] Running org.apache.hadoop.hive.cli.TestMiniDruidCliDriver [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 907.167 s - in org.apache.hadoop.hive.cli.TestMiniDruidCliDriver [INFO] [INFO] Results: [INFO] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 15:31 min [INFO] Finished at: 2018-05-04T13:15:11-07:00 [INFO] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19462) Fix mapping for char_length function to enable pushdown to Druid.
slim bouguerra created HIVE-19462: - Summary: Fix mapping for char_length function to enable pushdown to Druid. Key: HIVE-19462 URL: https://issues.apache.org/jira/browse/HIVE-19462 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra currently char_length is not push down to Druid because of missing mapping form/to calcite This patch will add this mapping. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19586) Optimize Count(distinct X) pushdown based on the storage capabilities
slim bouguerra created HIVE-19586: - Summary: Optimize Count(distinct X) pushdown based on the storage capabilities Key: HIVE-19586 URL: https://issues.apache.org/jira/browse/HIVE-19586 Project: Hive Issue Type: Improvement Components: Druid integration, Logical Optimizer Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 h1. Goal Provide a way to rewrite queries with combination of COUNT(Distinct) and Aggregates like SUM as a series of Group By. This can be useful to push down to Druid queries like {code} select count(DISTINCT interval_marker), count (distinct dim), sum(num_l) FROM druid_test_table GROUP BY `__time`, `zone` ; {code} In general this can be useful to be used in cases where storage handlers can not perform count (distinct column) h1. How to do it. Use the Calcite rule {code} org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule{code} that breaks down Count distinct to a single Group by with Grouping sets or multiple series of Group by that might be linked with Joins if multiple counts are present. FYI today Hive does have a similar rule {code} org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule{code}, but it only provides a rewrite to Grouping sets based plan. I am planing to use the actual Calcite rule, [~ashutoshc] any concerns or caveats to be aware of? h2. Concerns/questions Need to have a way to switch between Grouping sets or Simple chained group by based on the plan cost. For instance for Druid based scan it makes always sense (at least today) to push down a series of Group by and stitch result sets in Hive later (as oppose to scan everything). But this might be not true for other storage handler that can handle Grouping sets it is better to push down the Grouping sets as one table scan. Am still unsure how i can lean on the cost optimizer to select the best plan, [~ashutoshc]/[~jcamachorodriguez] any inputs? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19684) Hive stats optimizer wrongly uses stats against non native tables
slim bouguerra created HIVE-19684: - Summary: Hive stats optimizer wrongly uses stats against non native tables Key: HIVE-19684 URL: https://issues.apache.org/jira/browse/HIVE-19684 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Stats of non native tables are inaccurate, thus queries over non native tables can not optimized by stats optimizer. Take example of query {code} Explain select count(*) from (select `__time` from druid_test_table limit 1) as src ; {code} the plan will be reduced to {code} POSTHOOK: query: explain extended select count(*) from (select `__time` from druid_test_table limit 1) as src POSTHOOK: type: QUERY STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: 1 Processor Tree: ListSink {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19672) Column Names mismatch between native Druid Tables and Hive External map
slim bouguerra created HIVE-19672: - Summary: Column Names mismatch between native Druid Tables and Hive External map Key: HIVE-19672 URL: https://issues.apache.org/jira/browse/HIVE-19672 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Fix For: 4.0.0 Druid Columns names are case sensitive while Hive is case insensitive. This implies that any Druid Datasource that has columns with some upper cases as part of column name it will not return the expected results. One possible fix is to try to remap the column names before issuing Json Query to Druid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19674) Group by Decimal Constants push down to Druid tables.
slim bouguerra created HIVE-19674: - Summary: Group by Decimal Constants push down to Druid tables. Key: HIVE-19674 URL: https://issues.apache.org/jira/browse/HIVE-19674 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Queries like following gets generated by Tableau. {code} SELECT SUM(`ssb_druid_100`.`lo_revenue`) AS `sum_lo_revenue_ok` FROM `druid_ssb`.`ssb_druid_100` `ssb_druid_100` GROUP BY 1.1001; {code} The Group key is pushed down to Druid as a Constant Column, this leads to an Exception while parsing back the results since Druid Input format does not allow Decimals. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19675) Cast to timestamps on Druid time column leads to an exception
slim bouguerra created HIVE-19675: - Summary: Cast to timestamps on Druid time column leads to an exception Key: HIVE-19675 URL: https://issues.apache.org/jira/browse/HIVE-19675 Project: Hive Issue Type: Bug Components: Druid integration Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez The following query fail due to a formatting issue. {code} SELECT CAST(`ssb_druid_100`.`__time` AS TIMESTAMP) AS `x_time`, . . . . . . . . . . . . . . . .> SUM(`ssb_druid_100`.`lo_revenue`) AS `sum_lo_revenue_ok` . . . . . . . . . . . . . . . .> FROM `druid_ssb`.`ssb_druid_100` `ssb_druid_100` . . . . . . . . . . . . . . . .> GROUP BY CAST(`ssb_druid_100`.`__time` AS TIMESTAMP); {code} Exception {code} Error: java.io.IOException: java.lang.NumberFormatException: For input string: "1991-12-31 19:00:00" (state=,code=0) {code} [~jcamachorodriguez] maybe this is fixed by your upcoming patches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19695) Year Month Day extraction functions need to add an implicit cast for column that are String types
slim bouguerra created HIVE-19695: - Summary: Year Month Day extraction functions need to add an implicit cast for column that are String types Key: HIVE-19695 URL: https://issues.apache.org/jira/browse/HIVE-19695 Project: Hive Issue Type: Bug Components: Druid integration, Query Planning Affects Versions: 3.0.0 Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.1.0 To avoid surprising/wrong results, Hive Query plan shall add an explicit cast over non date/timestamp column type when user try to extract Year/Month/Hour etc.. This is an example of misleading results. {code} create table test_base_table(`timecolumn` timestamp, `date_c` string, `timestamp_c` string, `metric_c` double); insert into test_base_table values ('2015-03-08 00:00:00', '2015-03-10', '2015-03-08 00:00:00', 5.0); CREATE TABLE druid_test_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY") AS select cast(`timecolumn` as timestamp with local time zone) as `__time`, `date_c`, `timestamp_c`, `metric_c` FROM test_base_table; select year(date_c), month(date_c),day(date_c), hour(date_c), year(timestamp_c), month(timestamp_c),day(timestamp_c), hour(timestamp_c) from druid_test_table; {code} will return the following wrong results: {code} PREHOOK: query: select year(date_c), month(date_c),day(date_c), hour(date_c), year(timestamp_c), month(timestamp_c),day(timestamp_c), hour(timestamp_c) from druid_test_table PREHOOK: type: QUERY PREHOOK: Input: default@druid_test_table A masked pattern was here POSTHOOK: query: select year(date_c), month(date_c),day(date_c), hour(date_c), year(timestamp_c), month(timestamp_c),day(timestamp_c), hour(timestamp_c) from druid_test_table POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_test_table A masked pattern was here 196912 31 16 196912 31 16 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19607) Pushing Aggregates on Top of Aggregates
slim bouguerra created HIVE-19607: - Summary: Pushing Aggregates on Top of Aggregates Key: HIVE-19607 URL: https://issues.apache.org/jira/browse/HIVE-19607 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra Fix For: 3.1.0 This plan shows an instance where the count aggregates can be pushed to Druid which will eliminate the last stage reducer. {code} +PREHOOK: query: EXPLAIN select count(DISTINCT cstring2), sum(cdouble) FROM druid_table +PREHOOK: type: QUERY +POSTHOOK: query: EXPLAIN select count(DISTINCT cstring2), sum(cdouble) FROM druid_table +POSTHOOK: type: QUERY +STAGE DEPENDENCIES: + Stage-1 is a root stage + Stage-0 depends on stages: Stage-1 + +STAGE PLANS: + Stage: Stage-1 +Tez + A masked pattern was here + Edges: +Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) + A masked pattern was here + Vertices: +Map 1 +Map Operator Tree: +TableScan + alias: druid_table + properties: +druid.fieldNames cstring2,$f1 +druid.fieldTypes string,double +druid.query.json {"queryType":"groupBy","dataSource":"default.druid_table","granularity":"all","dimensions":[{"type":"default","dimension":"cstring2","outputName":"cstring2","outputType":"STRING"}],"limitSpec":{"type":"default"},"aggregations":[{"type":"doubleSum","name":"$f1","fieldName":"cdouble"}],"intervals":["1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"]} +druid.query.type groupBy + Statistics: Num rows: 9173 Data size: 1673472 Basic stats: COMPLETE Column stats: NONE + Select Operator +expressions: cstring2 (type: string), $f1 (type: double) +outputColumnNames: cstring2, $f1 +Statistics: Num rows: 9173 Data size: 1673472 Basic stats: COMPLETE Column stats: NONE +Group By Operator + aggregations: count(cstring2), sum($f1) + mode: hash + outputColumnNames: _col0, _col1 + Statistics: Num rows: 1 Data size: 208 Basic stats: COMPLETE Column stats: NONE + Reduce Output Operator +sort order: +Statistics: Num rows: 1 Data size: 208 Basic stats: COMPLETE Column stats: NONE +value expressions: _col0 (type: bigint), _col1 (type: double) +Reducer 2 +Reduce Operator Tree: + Group By Operator +aggregations: count(VALUE._col0), sum(VALUE._col1) +mode: mergepartial +outputColumnNames: _col0, _col1 +Statistics: Num rows: 1 Data size: 208 Basic stats: COMPLETE Column stats: NONE +File Output Operator + compressed: false + Statistics: Num rows: 1 Data size: 208 Basic stats: COMPLETE Column stats: NONE + table: + input format: org.apache.hadoop.mapred.SequenceFileInputFormat + output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat + serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe + {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19601) Unsupported Post join function
slim bouguerra created HIVE-19601: - Summary: Unsupported Post join function Key: HIVE-19601 URL: https://issues.apache.org/jira/browse/HIVE-19601 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra h1. As part of trying to use the Calcite rule {code} org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule#JOIN {code} i got the following Calcite plan {code} 2018-05-17T09:26:02,781 DEBUG [80d6d405-ed78-4f60-bd93-b3e08e424f73 main] translator.PlanModifierForASTConv: Final plan after modifier HiveProject(_c0=[$1], _c1=[$2]) HiveProject(zone=[$0], $f1=[$1], $f2=[$3]) HiveJoin(condition=[IS NOT DISTINCT FROM($0, $2)], joinType=[inner], algorithm=[none], cost=[not available]) HiveProject(zone=[$0], $f1=[$1]) HiveAggregate(group=[{0}], agg#0=[count($1)]) HiveProject(zone=[$0], interval_marker=[$1]) HiveAggregate(group=[{0, 1}]) HiveProject(zone=[$3], interval_marker=[$1]) HiveTableScan(table=[[druid_test_dst.test_base_table]], table:alias=[test_base_table]) HiveProject(zone=[$0], $f1=[$1]) HiveAggregate(group=[{0}], agg#0=[count($1)]) HiveProject(zone=[$0], dim=[$1]) HiveAggregate(group=[{0, 1}]) HiveProject(zone=[$3], dim=[$4]) HiveTableScan(table=[[druid_test_dst.test_base_table]], table:alias=[test_base_table]) {code} I run into this issue {code} 2018-05-17T09:26:02,876 ERROR [80d6d405-ed78-4f60-bd93-b3e08e424f73 main] parse.CalcitePlanner: CBO failed, skipping CBO. org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Invalid function 'IS NOT DISTINCT FROM' at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:1069) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1464) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19600) Hive and Calcite have different semantics for Grouping sets
slim bouguerra created HIVE-19600: - Summary: Hive and Calcite have different semantics for Grouping sets Key: HIVE-19600 URL: https://issues.apache.org/jira/browse/HIVE-19600 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra Fix For: 3.1.0 h1. Issue: Tried to use the calcite rule {code} org.apache.calcite.rel.rules.AggregateExpandDistinctAggregatesRule#AggregateExpandDistinctAggregatesRule(java.lang.Class, boolean, org.apache.calcite.tools.RelBuilderFactory) {code} to replace current rule used by Hive {code} org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveExpandDistinctAggregatesRule#HiveExpandDistinctAggregatesRule {code} But i got an exception when generating the Operator tree out of calcite plan. This is the Calcite plan {code} HiveProject.HIVE.[](input=rel#50:HiveAggregate.HIVE.[](input=rel#48:HiveProject.HIVE.[](input=rel#44:HiveAggregate.HIVE.[](input=rel#38:HiveProject.HIVE.[](input=rel#0:HiveTableScan.HIVE.[] (table=[druid_test_dst.test_base_table],table:alias=test_base_table)[false],$f0=$3,$f1=$1,$f2=$4),group={0, 1, 2},groups=[{0, 1}, {0, 2}],$g=GROUPING($0, $1, $2)),$f0=$0,$f1=$1,$f2=$2,$g_1==($3, 1),$g_2==($3, 2)),group={0},agg#0=count($1) FILTER $3,agg#1=count($2) FILTER $4),_o__c0=$1,_o__c1=$2) {code} This is the exception stack {code} 2018-05-17T08:46:48,604 ERROR [649a61b0-d8c7-45d8-962d-b1d38397feb4 main] ql.Driver: FAILED: SemanticException Line 0:-1 Argument type mismatch 'zone': The first argument to grouping() must be an int/long. Got: STRING org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Argument type mismatch 'zone': The first argument to grouping() must be an int/long. Got: STRING at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1467) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:239) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:185) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:12566) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:12521) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4525) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4298) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:10487) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10426) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11339) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11196) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11223) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11209) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:517) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12074) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:330) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:288) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:164) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:288) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:643) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1686) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1633) at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1628) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:214) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at
[jira] [Created] (HIVE-19615) Proper handling of is null and not is null predicate when pushed to Druid
slim bouguerra created HIVE-19615: - Summary: Proper handling of is null and not is null predicate when pushed to Druid Key: HIVE-19615 URL: https://issues.apache.org/jira/browse/HIVE-19615 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 Recent development in Druid introduced new semantic of null handling [here|https://github.com/b-slim/druid/commit/219e77aeac9b07dc20dd9ab2dd537f3f17498346] Based on those changes when need to honer push down of expressions with is null/ is not null predicates. The prosed fix overrides the mapping of Calcite Function to Druid Expression to much the correct semantic. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19680) Push down limit is not applied for Druid storage handler.
slim bouguerra created HIVE-19680: - Summary: Push down limit is not applied for Druid storage handler. Key: HIVE-19680 URL: https://issues.apache.org/jira/browse/HIVE-19680 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 Query like {code} select `__time` from druid_test_table limit 1; {code} returns more than one row. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19869) Remove double formatting bug followup of HIVE-19382
slim bouguerra created HIVE-19869: - Summary: Remove double formatting bug followup of HIVE-19382 Key: HIVE-19869 URL: https://issues.apache.org/jira/browse/HIVE-19869 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra HIVE-19382 has a minor bug that happens when users provide custom format as part of FROM_UNIXTIMESTAMP function. Here is an example query {code} SELECT SUM(`ssb_druid_100`.`lo_revenue`) AS `sum_lo_revenue_ok`, CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(`ssb_druid_100`.`__time` AS TIMESTAMP)), '-MM-dd HH:00:00') AS TIMESTAMP) AS `thr___time_ok` FROM `druid_ssb`.`ssb_druid_100` `ssb_druid_100` GROUP BY CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(CAST(`ssb_druid_100`.`__time` AS TIMESTAMP)), '-MM-dd HH:00:00') AS TIMESTAMP); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19868) Extract support for float aggregator
slim bouguerra created HIVE-19868: - Summary: Extract support for float aggregator Key: HIVE-19868 URL: https://issues.apache.org/jira/browse/HIVE-19868 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19923) Follow up of HIVE-19615, use UnaryFunction instead of prefix
slim bouguerra created HIVE-19923: - Summary: Follow up of HIVE-19615, use UnaryFunction instead of prefix Key: HIVE-19923 URL: https://issues.apache.org/jira/browse/HIVE-19923 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra Correct usage of Druid isnull function is {code} isnull(exp){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19879) Remove unused calcite sql operator.
slim bouguerra created HIVE-19879: - Summary: Remove unused calcite sql operator. Key: HIVE-19879 URL: https://issues.apache.org/jira/browse/HIVE-19879 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra HIVE-19796 introduced by mistake an unused sql operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19721) Druid Storage handler throws exception when query has a Cast to Date
slim bouguerra created HIVE-19721: - Summary: Druid Storage handler throws exception when query has a Cast to Date Key: HIVE-19721 URL: https://issues.apache.org/jira/browse/HIVE-19721 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.1 {code} SELECT CAST(`ssb_druid_100`.`__time` AS DATE) AS `x_time`, SUM(`ssb_druid_100`.`metric_c`) AS `sum_lo_revenue_ok` FROM `default`.`druid_test_table` `ssb_druid_100` GROUP BY CAST(`ssb_druid_100`.`__time` AS DATE); {code} {code} 2018-05-26T06:54:56,570 DEBUG [HttpClient-Netty-Worker-5] client.NettyHttpClient: [POST http://localhost:8082/druid/v2/] Got chunk: 0B, last=true 2018-05-26T06:54:56,572 ERROR [1917f624-7b94-4990-9e3a-bbfff3656365 main] CliDriver: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Unknown type: DATE java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: Unknown type: DATE at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:602) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:509) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:145) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2509) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:335) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1514) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1488) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:177) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestMiniDruidLocalCliDriver.testCliDriver(TestMiniDruidLocalCliDriver.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:92) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runners.Suite.runChild(Suite.java:127) at org.junit.runners.Suite.runChild(Suite.java:26) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:73) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at
[jira] [Created] (HIVE-19796) Push Down TRUNC Fn to Druid Storage Handler
slim bouguerra created HIVE-19796: - Summary: Push Down TRUNC Fn to Druid Storage Handler Key: HIVE-19796 URL: https://issues.apache.org/jira/browse/HIVE-19796 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Push down Queries with TRUNC date function such as {code} SELECT SUM((`ssb_druid_100`.`discounted_price` * `ssb_druid_100`.`net_revenue`)) AS `sum_calculation_4998925219892510720_ok`, CAST(TRUNC(CAST(`ssb_druid_100`.`__time` AS TIMESTAMP),'MM') AS DATE) AS `tmn___time_ok` FROM `druid_ssb`.`ssb_druid_100` `ssb_druid_100` GROUP BY CAST(TRUNC(CAST(`ssb_druid_100`.`__time` AS TIMESTAMP),'MM') AS DATE) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18573) Use proper Calcite operator instead of UDFs
slim bouguerra created HIVE-18573: - Summary: Use proper Calcite operator instead of UDFs Key: HIVE-18573 URL: https://issues.apache.org/jira/browse/HIVE-18573 Project: Hive Issue Type: Bug Components: Hive Reporter: slim bouguerra Currently, Hive is mostly using user-defined black box sql operators during Query planning. It will be more beneficial to use proper calcite operators. Also, Use a single name for Extract operator instead of a different name for every Unit, Same for Floor function. This will allow unifying the treatment per operator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18595) UNIX_TIMESTAMP UDF fails when type is Timestamp with local timezone
slim bouguerra created HIVE-18595: - Summary: UNIX_TIMESTAMP UDF fails when type is Timestamp with local timezone Key: HIVE-18595 URL: https://issues.apache.org/jira/browse/HIVE-18595 Project: Hive Issue Type: Bug Reporter: slim bouguerra {code} 2018-01-31T12:59:45,464 ERROR [10e97c86-7f90-406b-a8fa-38be5d3529cc main] ql.Driver: FAILED: SemanticException [Error 10014]: Line 3:456 Wrong arguments ''-MM-dd HH:mm:ss'': The function UNIX_TIMESTAMP takes only string/date/timestamp types org.apache.hadoop.hive.ql.parse.SemanticException: Line 3:456 Wrong arguments ''-MM-dd HH:mm:ss'': The function UNIX_TIMESTAMP takes only string/date/timestamp types at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1394) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:235) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:181) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:11847) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:11780) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genGBLogicalPlan(CalcitePlanner.java:3140) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:4330) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1407) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1354) at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:118) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:1052) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:154) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:111) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1159) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1175) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:422) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11393) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:304) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:163) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:639) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1504) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1632) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1395) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1382) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:240) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:343) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1331) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1305) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:173) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver(TestMiniDruidCliDriver.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at
[jira] [Created] (HIVE-18594) DATEDIFF UDF fails when type is timestamp with Local timezone.
slim bouguerra created HIVE-18594: - Summary: DATEDIFF UDF fails when type is timestamp with Local timezone. Key: HIVE-18594 URL: https://issues.apache.org/jira/browse/HIVE-18594 Project: Hive Issue Type: Bug Components: Hive Reporter: slim bouguerra {code} 2018-01-31T12:45:08,488 ERROR [9b5c5020-b1f5-4703-8c2e-bac4aa01a578 main] ql.Driver: FAILED: SemanticException [Error 10014]: Line 3:88 Wrong arguments ''2004-07-04'': DATEDIFF() o nly takes STRING/TIMESTAMP/DATEWRITABLE types as 1-th argument, got TIMESTAMPLOCALTZ org.apache.hadoop.hive.ql.parse.SemanticException: Line 3:88 Wrong arguments ''2004-07-04'': DATEDIFF() only takes STRING/TIMESTAMP/DATEWRITABLE types as 1-th argument, got TIMESTA MPLOCALTZ at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1394) at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89) at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:76) at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:235) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:181) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:11847) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:11802) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genSelectLogicalPlan(CalcitePlanner.java:4005) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:4336) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1407) at org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1354) at org.apache.calcite.tools.Frameworks$1.apply(Frameworks.java:118) at org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:1052) at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:154) at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:111) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1159) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.getOptimizedAST(CalcitePlanner.java:1175) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:422) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:11393) at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:304) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:163) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:268) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:639) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1504) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1632) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1395) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1382) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:240) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:343) at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1331) at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1305) at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:173) at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) at org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver(TestMiniDruidCliDriver.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at
[jira] [Created] (HIVE-18730) Use LLAP as execution engine for Druid mini Cluster Tests
slim bouguerra created HIVE-18730: - Summary: Use LLAP as execution engine for Druid mini Cluster Tests Key: HIVE-18730 URL: https://issues.apache.org/jira/browse/HIVE-18730 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 Currently, we are using local MR to run Mini Cluster tests. It will be better to use LLAP cluster or TEZ. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18731) Add Documentations about this feature.
slim bouguerra created HIVE-18731: - Summary: Add Documentations about this feature. Key: HIVE-18731 URL: https://issues.apache.org/jira/browse/HIVE-18731 Project: Hive Issue Type: Sub-task Components: Druid integration Reporter: slim bouguerra need to add basic docs about new table properties and what it means in practice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18732) Push order/limit to Druid historical when approximate results are allowed
slim bouguerra created HIVE-18732: - Summary: Push order/limit to Druid historical when approximate results are allowed Key: HIVE-18732 URL: https://issues.apache.org/jira/browse/HIVE-18732 Project: Hive Issue Type: Improvement Reporter: slim bouguerra Druid 0.11 allow force push down of Order by Limit to historicals using a context Query Flag \{code} forcePushDownLimit\{code}. [http://druid.io/docs/latest/querying/groupbyquery.html] As per the docs [http://druid.io/docs/latest/querying/groupbyquery.html|http://druid.io/docs/latest/querying/groupbyquery.html.], this is a great optimization that can be used if the approximate results are allowed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18729) Druid Time column type
slim bouguerra created HIVE-18729: - Summary: Druid Time column type Key: HIVE-18729 URL: https://issues.apache.org/jira/browse/HIVE-18729 Project: Hive Issue Type: Task Components: Druid integration Reporter: slim bouguerra Assignee: Jesus Camacho Rodriguez I have talked Offline with [~jcamachorodriguez] about this and agreed that the best way to go is to support both cases where Druid time column can be Timestamp or Timestamp with local time zone. In fact, for the Hive-Druid internal table, this makes perfect sense since we have Hive metadata about the time column during the CTAS statement then we can handle both cases as we do for another type of storage eg ORC. For the Druid external tables, we can have a default type and allow the user to override that via table properties. CC [~ashutoshc] and [~nishantbangarwa]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18780) Improve schema discovery For Druid Storage Handler
slim bouguerra created HIVE-18780: - Summary: Improve schema discovery For Druid Storage Handler Key: HIVE-18780 URL: https://issues.apache.org/jira/browse/HIVE-18780 Project: Hive Issue Type: Improvement Reporter: slim bouguerra Assignee: slim bouguerra Currently, Druid Storage adapter issues a Segment metadata Query every time the query is of type Select or Scan. Not only that but then every input split (map) will do the same as well since it is using the same Serde, this is very expensive and put a lot of pressure on the Druid Cluster. The way to fix this is to add the schema out of the calcite plan instead of serializing the query itself as part of the Hive query context. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18331) Renew the Kerberos ticket used by Druid Query runner
slim bouguerra created HIVE-18331: - Summary: Renew the Kerberos ticket used by Druid Query runner Key: HIVE-18331 URL: https://issues.apache.org/jira/browse/HIVE-18331 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Druid Http Client has to renew the current user Kerberos ticket when it is close to expire. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-20375) Json SerDe ignoring the timestamp.formats property
slim bouguerra created HIVE-20375: - Summary: Json SerDe ignoring the timestamp.formats property Key: HIVE-20375 URL: https://issues.apache.org/jira/browse/HIVE-20375 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: slim bouguerra JsonSerd is supposed to accept "timestamp.formats" SerDe property to allow different timestamp formats, after recent refactor I see that this is not working anymore. Looking at the code I can see that The serde is not using the constructed parser with added format https://github.com/apache/hive/blob/1105ef3974d8a324637d3d35881a739af3aeb382/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L82 But instead it is using Converter https://github.com/apache/hive/blob/1105ef3974d8a324637d3d35881a739af3aeb382/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonStructReader.java#L324 Then converter is using org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter.TimestampConverter This converter does not have any knowledge about user formats or what so ever... It is using this static converter org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils#getTimestampFromString -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20376) Timestamp Timezone parser dosen't handler ISO formats "2013-08-31T01:02:33Z"
slim bouguerra created HIVE-20376: - Summary: Timestamp Timezone parser dosen't handler ISO formats "2013-08-31T01:02:33Z" Key: HIVE-20376 URL: https://issues.apache.org/jira/browse/HIVE-20376 Project: Hive Issue Type: Bug Reporter: slim bouguerra It will be nice to add ISO formats to timezone utils parser to handler the following "2013-08-31T01:02:33Z" org.apache.hadoop.hive.common.type.TimestampTZUtil#parse(java.lang.String) CC [~jcamachorodriguez]/ [~ashutoshc] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20377) Hive Kafka Storage Handler
slim bouguerra created HIVE-20377: - Summary: Hive Kafka Storage Handler Key: HIVE-20377 URL: https://issues.apache.org/jira/browse/HIVE-20377 Project: Hive Issue Type: Bug Affects Versions: 4.0.0 Reporter: slim bouguerra Assignee: slim bouguerra h1. Goal * Read streaming data form Kafka queue as an external table. * Allow streaming navigation by pushing down filters on Kafka record partition id, offset and timestamp. * Insert streaming data form Kafka to an actual Hive internal table, using CTAS statement. h1. Example h2. Create the external table {code} CREATE EXTERNAL TABLE kafka_table (`timestamp` timestamps, page string, `user` string, language string, added int, deleted int, flags string,comment string, namespace string) STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler' TBLPROPERTIES ("kafka.topic" = "wikipedia", "kafka.bootstrap.servers"="brokeraddress:9092", "kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe"); {code} h2. Kafka Metadata In order to keep track of Kafka records the storage handler will add automatically the Kafka row metadata eg partition id, record offset and record timestamp. {code} DESCRIBE EXTENDED kafka_table timestamp timestamp from deserializer pagestring from deserializer userstring from deserializer languagestring from deserializer country string from deserializer continent string from deserializer namespace string from deserializer newpage boolean from deserializer unpatrolled boolean from deserializer anonymous boolean from deserializer robot boolean from deserializer added int from deserializer deleted int from deserializer delta bigint from deserializer __partition int from deserializer __offsetbigint from deserializer __timestamp bigint from deserializer {code} h2. Filter push down. Newer Kafka consumers 0.11.0 and higher allow seeking on the stream based on a given offset. The proposed storage handler will be able to leverage such API by pushing down filters over metadata columns, namely __partition (int), __offset(long) and __timestamp(long) For instance Query like {code} select `__offset` from kafka_table where (`__offset` < 10 and `__offset`>3 and `__partition` = 0) or (`__partition` = 0 and `__offset` < 105 and `__offset` > 99) or (`__offset` = 109); {code} Will result on a scan of partition 0 only then read only records between offset 4 and 109. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20426) Upload Druid Test Runner logs from Build Slaves
slim bouguerra created HIVE-20426: - Summary: Upload Druid Test Runner logs from Build Slaves Key: HIVE-20426 URL: https://issues.apache.org/jira/browse/HIVE-20426 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: Vineet Garg Currently only hive log is uploaded from "hive/itests/qtest/tmp/log/" It will be very valuable if we can add the following Druid logs * coordinator.log * broker.log * historical.log -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20427) Remove Druid Mock tests from CliDrive
slim bouguerra created HIVE-20427: - Summary: Remove Druid Mock tests from CliDrive Key: HIVE-20427 URL: https://issues.apache.org/jira/browse/HIVE-20427 Project: Hive Issue Type: Improvement Reporter: slim bouguerra Assignee: slim bouguerra as per comment https://issues.apache.org/jira/browse/HIVE-20425?focusedCommentId=16586272=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16586272 We do not need to run those Mock Druid tests anymore, since org.apache.hadoop.hive.cli.TestMiniDruidCliDriver cover most of this cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20425) Use a custom range of port for embedded Derby used by Druid.
slim bouguerra created HIVE-20425: - Summary: Use a custom range of port for embedded Derby used by Druid. Key: HIVE-20425 URL: https://issues.apache.org/jira/browse/HIVE-20425 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Seems like good amount of the flakiness of Druid Tests is due to port collision between Derby used by Hive and the one used by Druid. The goal of this Patch is to use a custom range 60_000 to 65535 and find the first available to be used by Druid Derby process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20481) Add the Kafka Key record as part of the row.
slim bouguerra created HIVE-20481: - Summary: Add the Kafka Key record as part of the row. Key: HIVE-20481 URL: https://issues.apache.org/jira/browse/HIVE-20481 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra Assignee: slim bouguerra Kafka records are keyed, most of the case this key is null or used to route records to the same partition. This patch adds this column as a binary column {code} __record_key{code}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20485) Test Storage Handler with Secured Kafka Cluster
slim bouguerra created HIVE-20485: - Summary: Test Storage Handler with Secured Kafka Cluster Key: HIVE-20485 URL: https://issues.apache.org/jira/browse/HIVE-20485 Project: Hive Issue Type: Sub-task Reporter: slim bouguerra Assignee: slim bouguerra Need to test this with Secured Kafka Cluster. * Kerberos * SSL support -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20094) Update Druid to 0.12.1 version
slim bouguerra created HIVE-20094: - Summary: Update Druid to 0.12.1 version Key: HIVE-20094 URL: https://issues.apache.org/jira/browse/HIVE-20094 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra As per Jira title. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18959) Avoid creating extra pool of threads within LLAP
slim bouguerra created HIVE-18959: - Summary: Avoid creating extra pool of threads within LLAP Key: HIVE-18959 URL: https://issues.apache.org/jira/browse/HIVE-18959 Project: Hive Issue Type: Task Components: Druid integration Environment: Kerberos Cluster Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 The current Druid-Kerberos-Http client is using an external single threaded pool to handle retry auth calls (eg when a cookie expire or other transient auth issues). First, this is not buying us anything since all the Druid Task is executed as one synchronous task. Second, this can cause a major issue if an exception occurs that leads to shutting down the LLAP main thread. Thus to fix this we should avoid using an external thread pool and handle retrying in a synchronous way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19155) Day time saving cause Druid inserts to fail with org.apache.hive.druid.io.druid.java.util.common.UOE: Cannot add overlapping segments
slim bouguerra created HIVE-19155: - Summary: Day time saving cause Druid inserts to fail with org.apache.hive.druid.io.druid.java.util.common.UOE: Cannot add overlapping segments Key: HIVE-19155 URL: https://issues.apache.org/jira/browse/HIVE-19155 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra If you try to insert data around the daylight saving time hour the query fails with following exception {code} 2018-04-10T11:24:58,836 ERROR [065fdaa2-85f9-4e49-adaf-3dc14d51be90 main] exec.DDLTask: Failed org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hive.druid.io.druid.java.util.common.UOE: Cannot add overlapping segments [2015-03-08T05:00:00.000Z/2015-03-09T05:00:00.000Z and 2015-03-09T04:00:00.000Z/2015-03-10T04:00:00.000Z] with the same version [2018-04-10T11:24:48.388-07:00] at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:914) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:919) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4831) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:394) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2443) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2114) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1797) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1538) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1532) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:204) [hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) [hive-cli-3.1.0-SNAPSHOT.jar:?] at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) [hive-cli-3.1.0-SNAPSHOT.jar:?] at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) [hive-cli-3.1.0-SNAPSHOT.jar:?] at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:335) [hive-cli-3.1.0-SNAPSHOT.jar:?] at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1455) [hive-it-util-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1429) [hive-it-util-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:177) [hive-it-util-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104) [hive-it-util-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] at org.apache.hadoop.hive.cli.TestMiniDruidCliDriver.testCliDriver(TestMiniDruidCliDriver.java:59) [test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_92] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_92] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_92] {code} You can reproduce this using the following DDL {code} create database druid_test; use druid_test; create table test_table(`timecolumn` timestamp, `userid` string, `num_l` float); insert into test_table values ('2015-03-08 00:00:00', 'i1-start', 4); insert into test_table values ('2015-03-08 23:59:59', 'i1-end', 1); insert into test_table values ('2015-03-09 00:00:00', 'i2-start', 4); insert into test_table values ('2015-03-09 23:59:59', 'i2-end', 1); insert into test_table values ('2015-03-10 00:00:00', 'i3-start', 2); insert into test_table values ('2015-03-10 23:59:59', 'i3-end', 2); CREATE TABLE druid_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.segment.granularity" = "DAY") AS select cast(`timecolumn` as timestamp with local time zone) as `__time`, `userid`, `num_l` FROM test_table; {code} The fix is to always adjust the Druid segments identifiers to UTC.
[jira] [Created] (HIVE-19157) Assert that Insert into Druid Table it fails.
slim bouguerra created HIVE-19157: - Summary: Assert that Insert into Druid Table it fails. Key: HIVE-19157 URL: https://issues.apache.org/jira/browse/HIVE-19157 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra The usual work flow of loading Data into Druid relies on the fact that HS2 is able to load Segments metadata from HDFS that are produced by LLAP/TEZ works. In some cases where HS2 is not able to perform `ls` on the HDFS path the insert into query will return success and will not insert any data. This bug was introduced at function {code} org.apache.hadoop.hive.druid.DruidStorageHandlerUtils#getCreatedSegments{code} when we added feature to allow create empty tables. {code} try { fss = fs.listStatus(taskDir); } catch (FileNotFoundException e) { // This is a CREATE TABLE statement or query executed for CTAS/INSERT // did not produce any result. We do not need to do anything, this is // expected behavior. return publishedSegmentsBuilder.build(); } {code} Am still looking for the way to fix this, [~jcamachorodriguez]/[~ashutoshc] any idea what is the best way to detect that it is an empty create table statement? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19187) Update Druid Storage Handler to Druid 0.12.0
slim bouguerra created HIVE-19187: - Summary: Update Druid Storage Handler to Druid 0.12.0 Key: HIVE-19187 URL: https://issues.apache.org/jira/browse/HIVE-19187 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.1.0 Current used Druid Version is 0.11.0 This Patch updates the Druid version to the most recent version 0.12.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19239) Check for possible null timestamp fields during SerDe from Druid events
slim bouguerra created HIVE-19239: - Summary: Check for possible null timestamp fields during SerDe from Druid events Key: HIVE-19239 URL: https://issues.apache.org/jira/browse/HIVE-19239 Project: Hive Issue Type: Bug Reporter: slim bouguerra Assignee: slim bouguerra Currently we do not check for possible null timestamp events. This might lead to NPE. This Patch add addition check for such case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19298) Fix operator tree of CTAS for Druid Storage Handler
slim bouguerra created HIVE-19298: - Summary: Fix operator tree of CTAS for Druid Storage Handler Key: HIVE-19298 URL: https://issues.apache.org/jira/browse/HIVE-19298 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.1.0 Current operator plan of CTAS for Druid storage handler is broken when used enables the property \{code} hive.exec.parallel\{code} as \{code} true\{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19044) Duplicate field names within Druid Query Generated by Calcite plan
slim bouguerra created HIVE-19044: - Summary: Duplicate field names within Druid Query Generated by Calcite plan Key: HIVE-19044 URL: https://issues.apache.org/jira/browse/HIVE-19044 Project: Hive Issue Type: Bug Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra This is the Query plan as you can see "$f4" is duplicated. {code} PREHOOK: query: EXPLAIN SELECT Calcs.key AS none_key_nk, SUM(Calcs.num0) AS temp_z_stdevp_num0___1723718801__0_, COUNT(Calcs.num0) AS temp_z_stdevp_num0___2730138885__0_, SUM((Calcs.num0 * Calcs.num0)) AS temp_z_stdevp_num0___4071133194__0_, STDDEV_POP(Calcs.num0) AS stp_num0_ok FROM druid_tableau.calcs Calcs GROUP BY Calcs.key PREHOOK: type: QUERY POSTHOOK: query: EXPLAIN SELECT Calcs.key AS none_key_nk, SUM(Calcs.num0) AS temp_z_stdevp_num0___1723718801__0_, COUNT(Calcs.num0) AS temp_z_stdevp_num0___2730138885__0_, SUM((Calcs.num0 * Calcs.num0)) AS temp_z_stdevp_num0___4071133194__0_, STDDEV_POP(Calcs.num0) AS stp_num0_ok FROM druid_tableau.calcs Calcs GROUP BY Calcs.key POSTHOOK: type: QUERY STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: calcs properties: druid.fieldNames key,$f1,$f2,$f3,$f4 druid.fieldTypes string,double,bigint,double,double druid.query.json {"queryType":"groupBy","dataSource":"druid_tableau.calcs","granularity":"all","dimensions":[{"type":"default","dimension":"key","outputName":"key","outputType":"STRING"}],"limitSpec":{"type":"default"},"aggregations":[{"type":"doubleSum","name":"$f1","fieldName":"num0"},{"type":"filtered","filter":{"type":"not","field":{"type":"selector","dimension":"num0","value":null}},"aggregator":{"type":"count","name":"$f2","fieldName":"num0"}},{"type":"doubleSum","name":"$f3","expression":"(\"num0\" * \"num0\")"},{"type":"doubleSum","name":"$f4","expression":"(\"num0\" * \"num0\")"}],"postAggregations":[{"type":"expression","name":"$f4","expression":"pow(((\"$f4\" - ((\"$f1\" * \"$f1\") / \"$f2\")) / \"$f2\"),0.5)"}],"intervals":["1900-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"]} druid.query.type groupBy Select Operator expressions: key (type: string), $f1 (type: double), $f2 (type: bigint), $f3 (type: double), $f4 (type: double) outputColumnNames: _col0, _col1, _col2, _col3, _col4 ListSink {code} Table DDL {code} create database druid_tableau; use druid_tableau; drop table if exists calcs; create table calcs STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ( "druid.segment.granularity" = "MONTH", "druid.query.granularity" = "DAY") AS SELECT cast(datetime0 as timestamp with local time zone) `__time`, key, str0, str1, str2, str3, date0, date1, date2, date3, time0, time1, datetime1, zzz, cast(bool0 as string) bool0, cast(bool1 as string) bool1, cast(bool2 as string) bool2, cast(bool3 as string) bool3, int0, int1, int2, int3, num0, num1, num2, num3, num4 from default.calcs_orc; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19070) Add More Test To Druid Mini Cluster 200 Tableau kind queries.
slim bouguerra created HIVE-19070: - Summary: Add More Test To Druid Mini Cluster 200 Tableau kind queries. Key: HIVE-19070 URL: https://issues.apache.org/jira/browse/HIVE-19070 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra Fix For: 3.0.0 In This patch am adding 200 new tableau query that runs over a new Data set called calcs. The data set is very small. I also have consolidated 3 different tests to run as one test this will help with keeping execution time low. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19023) Druid storage Handler still using old select query when the CBO fails
slim bouguerra created HIVE-19023: - Summary: Druid storage Handler still using old select query when the CBO fails Key: HIVE-19023 URL: https://issues.apache.org/jira/browse/HIVE-19023 Project: Hive Issue Type: Improvement Components: Druid integration Reporter: slim bouguerra Assignee: slim bouguerra See usage of function {code} org.apache.hadoop.hive.druid.io.DruidQueryBasedInputFormat#createSelectStarQuery{code} this can be replaced by scan query that is more efficent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18993) Use Druid Expressions
slim bouguerra created HIVE-18993: - Summary: Use Druid Expressions Key: HIVE-18993 URL: https://issues.apache.org/jira/browse/HIVE-18993 Project: Hive Issue Type: Task Reporter: slim bouguerra -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-19011) Druid Storage Handler returns conflicting results for Qtest druidmini_dynamic_partition.q
slim bouguerra created HIVE-19011: - Summary: Druid Storage Handler returns conflicting results for Qtest druidmini_dynamic_partition.q Key: HIVE-19011 URL: https://issues.apache.org/jira/browse/HIVE-19011 Project: Hive Issue Type: Bug Reporter: slim bouguerra This git diff shows the conflicting results {code} diff --git a/ql/src/test/results/clientpositive/druid/druidmini_dynamic_partition.q.out b/ql/src/test/results/clientpositive/druid/druidmini_dynamic_partition.q.out index 714778ebfc..cea9b7535c 100644 --- a/ql/src/test/results/clientpositive/druid/druidmini_dynamic_partition.q.out +++ b/ql/src/test/results/clientpositive/druid/druidmini_dynamic_partition.q.out @@ -243,7 +243,7 @@ POSTHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_partitioned_table POSTHOOK: Output: hdfs://### HDFS PATH ### -1408069801800 4139540644 10992545287 165393120 +1408069801800 3272553822 10992545287 -648527473 PREHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM druid_partitioned_table_0 PREHOOK: type: QUERY PREHOOK: Input: default@druid_partitioned_table_0 @@ -429,7 +429,7 @@ POSTHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM d POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_partitioned_table POSTHOOK: Output: hdfs://### HDFS PATH ### -2857395071862 4139540644 -1661313883124 885815256 +2857395071862 3728054572 -1661313883124 71894663 PREHOOK: query: EXPLAIN INSERT OVERWRITE TABLE druid_partitioned_table SELECT cast (`ctimestamp1` as timestamp with local time zone) as `__time`, cstring1, @@ -566,7 +566,7 @@ POSTHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM d POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_partitioned_table POSTHOOK: Output: hdfs://### HDFS PATH ### -1408069801800 7115092987 10992545287 1232243564 +1408069801800 4584782821 10992545287 -1808876374 PREHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM druid_partitioned_table_0 PREHOOK: type: QUERY PREHOOK: Input: default@druid_partitioned_table_0 @@ -659,7 +659,7 @@ POSTHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM d POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_partitioned_table POSTHOOK: Output: hdfs://### HDFS PATH ### -1408069801800 7115092987 10992545287 1232243564 +1408069801800 4584782821 10992545287 -1808876374 PREHOOK: query: EXPLAIN SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM druid_max_size_partition PREHOOK: type: QUERY POSTHOOK: query: EXPLAIN SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM druid_max_size_partition @@ -758,7 +758,7 @@ POSTHOOK: query: SELECT sum(cint), max(cbigint), sum(cbigint), max(cint) FROM d POSTHOOK: type: QUERY POSTHOOK: Input: default@druid_partitioned_table POSTHOOK: Output: hdfs://### HDFS PATH ### -1408069801800 7115092987 10992545287 1232243564 +1408069801800 4584782821 10992545287 -1808876374 PREHOOK: query: DROP TABLE druid_partitioned_table_0 PREHOOK: type: DROPTABLE PREHOOK: Input: default@druid_partitioned_table_0 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-18996) SubString Druid convertor assuming that index is always constant literal value
slim bouguerra created HIVE-18996: - Summary: SubString Druid convertor assuming that index is always constant literal value Key: HIVE-18996 URL: https://issues.apache.org/jira/browse/HIVE-18996 Project: Hive Issue Type: Bug Reporter: slim bouguerra Query like the following {code} SELECT substring(namespace, CAST(deleted AS INT), 4) FROM druid_table_1; {code} will fail with {code} java.lang.AssertionError: not a literal: $13 at org.apache.calcite.rex.RexLiteral.findValue(RexLiteral.java:963) at org.apache.calcite.rex.RexLiteral.findValue(RexLiteral.java:955) at org.apache.calcite.rex.RexLiteral.intValue(RexLiteral.java:938) at org.apache.calcite.adapter.druid.SubstringOperatorConversion.toDruidExpression(SubstringOperatorConversion.java:46) at org.apache.calcite.adapter.druid.DruidExpressions.toDruidExpression(DruidExpressions.java:120) at org.apache.calcite.adapter.druid.DruidQuery.computeProjectAsScan(DruidQuery.java:746) at org.apache.calcite.adapter.druid.DruidRules$DruidProjectRule.onMatch(DruidRules.java:308) at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:317) {code} because is assuming that index is always a constant literal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20735) Address some of the review comments.
slim bouguerra created HIVE-20735: - Summary: Address some of the review comments. Key: HIVE-20735 URL: https://issues.apache.org/jira/browse/HIVE-20735 Project: Hive Issue Type: Sub-task Components: kafka integration Reporter: slim bouguerra Assignee: slim bouguerra As part of the review comments we agreed to: # remove start and end offsets columns # remove the best effort mode # make the 2pc as default protocol for EOS Also this patch will include an additional enhancement to add kerberos support. -- This message was sent by Atlassian JIRA (v7.6.3#76005)