[ https://issues.apache.org/jira/browse/HUDI-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nishith Agarwal reassigned HUDI-258: ------------------------------------ Assignee: Nishith Agarwal > Hive Query engine not supporting join queries between RT and RO tables > ---------------------------------------------------------------------- > > Key: HUDI-258 > URL: https://issues.apache.org/jira/browse/HUDI-258 > Project: Apache Hudi (incubating) > Issue Type: Bug > Components: Hive Integration > Reporter: Balaji Varadarajan > Assignee: Nishith Agarwal > Priority: Major > > Description : > [https://github.com/apache/incubator-hudi/issues/789#issuecomment-512740619] > > Root Cause: Hive is tracking getSplits calls by dataset basePath and does not > take INputFormatClass into account. Hence getSplits() is called only once. In > the case of RO and RT tables, they both have same dataset base-path but > differ in the InputFormatClass. Due to this, Hive join query is returning > weird results. > > ============= > The result of the demo is very strange > (Step 6(a)) > > {{ select `_hoodie_commit_time`, symbol, ts, volume, open, close from > stock_ticks_mor_rt where symbol = 'GOOG'; > select `_hoodie_commit_time`, symbol, ts, volume, open, close from > stock_ticks_mor where symbol = 'GOOG';}} > return as demo > BUT! > > {{select a.key,a.ts, b.ts from stock_ticks_mor a join stock_ticks_mor_rt b > on a.key=b.key where a.ts != b.ts > ... > +--------+-------+-------+--+ > | a.key | a.ts | b.ts | > +--------+-------+-------+--+ > +--------+-------+-------+--+}} > > {{0: jdbc:hive2://hiveserver:10000> select a.key,a.ts,b.ts from > stock_ticks_mor_rt a join stock_ticks_mor b on a.key = b.key where a.key= > 'GOOG_2018-08-31 10'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Execution log at: > /tmp/root/root_20190718091316_ec40e8f2-be17-4450-bb75-8db9f4390041.log > 2019-07-18 09:13:20 Starting to launch local task to process map join; > maximum memory = 477626368 > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > 2019-07-18 09:13:21 Dump the side-table for tag: 0 with group count: 1 into > file: > file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-16_658_8306103829282410332-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile50--.hashtable > 2019-07-18 09:13:21 Uploaded 1 File to: > file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-16_658_8306103829282410332-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile50--.hashtable > (317 bytes) > 2019-07-18 09:13:21 End of local task; Time Taken: 1.688 sec. > +---------------------+----------------------+----------------------+--+ > | a.key | a.ts | b.ts | > +---------------------+----------------------+----------------------+--+ > | GOOG_2018-08-31 10 | 2018-08-31 10:29:00 | 2018-08-31 10:29:00 | > +---------------------+----------------------+----------------------+--+ > 1 row selected (7.207 seconds) > 0: jdbc:hive2://hiveserver:10000> select a.key,a.ts,b.ts from stock_ticks_mor > a join stock_ticks_mor_rt b on a.key = b.key where a.key= 'GOOG_2018-08-31 > 10'; > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the > future versions. Consider using a different execution engine (i.e. spark, > tez) or using Hive 1.X releases. > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Execution log at: > /tmp/root/root_20190718091348_72a5fc30-fc04-41c1-b2e3-5f943e4d5c08.log > 2019-07-18 09:13:51 Starting to launch local task to process map join; > maximum memory = 477626368 > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > 2019-07-18 09:13:53 Dump the side-table for tag: 0 with group count: 1 into > file: > file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-48_027_3613368446029280476-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile60--.hashtable > 2019-07-18 09:13:53 Uploaded 1 File to: > file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-48_027_3613368446029280476-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile60--.hashtable > (317 bytes) > 2019-07-18 09:13:53 End of local task; Time Taken: 2.36 sec. > +---------------------+----------------------+----------------------+--+ > | a.key | a.ts | b.ts | > +---------------------+----------------------+----------------------+--+ > | GOOG_2018-08-31 10 | 2018-08-31 10:59:00 | 2018-08-31 10:59:00 | > +---------------------+----------------------+----------------------+--+}} -- This message was sent by Atlassian Jira (v8.3.4#803005)