Witold Drabicki created HIVE-25946: -------------------------------------- Summary: select from external table pointing to MySQL returns multiple copies of the same row Key: HIVE-25946 URL: https://issues.apache.org/jira/browse/HIVE-25946 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.3.7 Environment: Hive runs on *GCP Dataproc,* image version is *1.5.56-debian10* (Hive v {*}2.3.7{*})
*MySQL* server version is {*}5.7.36{*}. The following jars are used: {code:java} add jar gs://d-test-bucket-1/commons-pool-1.6.jar; add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49){code} Reporter: Witold Drabicki The following table has been created in Hive: {code:java} CREATE EXTERNAL TABLE table_with_4_rows ( col1 varchar(100), col2 varchar(15), col3 TIMESTAMP, col4 TIMESTAMP ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver", "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>", "hive.sql.dbcp.username" = "<name>", "hive.sql.dbcp.password" = "<password>", "hive.sql.table" = "TABLE_WITH_4_ROWS", "hive.sql.schema" = "schema-name", "hive.sql.query" = "select col1, col2, col3, col4 from schema-name.TABLE_WITH_4_ROWS", "hive.sql.numPartitions" = "1", "hive.sql.dbcp.maxActive" = "1" );{code} The table in MySQL has just 4 rows, and is defined as: {code:java} CREATE TABLE `TABLE_WITH_4_ROWS` ( `col1` varchar(100) NOT NULL DEFAULT '', `col2` varchar(15) DEFAULT NULL, `col3` datetime DEFAULT NULL, `col4` datetime DEFAULT NULL, PRIMARY KEY (`col1`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;{code} When cluster is *not 100% busy* and has idle containers, running *select col1, col2 from table_with_4_rows* results in a job that uses 49 mappers and no reducers, and returns 187 rows, instead of 4 (each original row is duplicated multiple times in the results). Running the same select but with *WHERE col1 = 'specific-value'* also uses 49 mappers and instead of returning 1 row also returns duplicated data (46 to 48 rows, depending on the value). When cluster is *100% busy* and the job needs to reclaim containers from other jobs, the above queries use just 1 mapper and *return correct data* (4 and 1 row, correspondingly). Running *ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS* does not change the results, however, it also works incorrectly as it sets +numRows+ in the table's metadata also to 187. There's *ArrayIndexOutOfBoundsException* *Error during condition build* exception thrown during the query execution. Here's the output from the *log* file: {code:java} 2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1 org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. Continuing ... 2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1 2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] |Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed 2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: Found configuration file null 2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false 2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitLocationProvider: org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567 2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml 2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] |exec.SerializationUtilities|: Deserializing MapWork using kryo 2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB 2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0] |dag.RootInputInitializerManager|: Starting InputInitializer for Input: table_with_4_rows on vertex vertex_1644267138354_0004_1_00 [Map 1] 2022-02-08 20:43:40,277 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: GenerateConsistentSplitsInHive=true 2022-02-08 20:43:40,287 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: vertex_1644267138354_0004_1_00 [Map 1] transitioned from NEW to INITIALIZING due to event V_INIT 2022-02-08 20:43:40,288 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: The preferred split size is 52428800 2022-02-08 20:43:40,289 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml 2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Processing alias table_with_4_rows 2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Adding 1 inputs; the first input is hdfs://.../user/wdrabicki/hive/database/table_with_4_rows 2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.ids = 0,1 2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.names = col1,col2 2022-02-08 20:43:40,300 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: Generating splits for dirs: hdfs://.../user/wdrabicki/hive/database/table_with_4_rows 2022-02-08 20:43:40,642 [ERROR] [InputInitializer {Map 1} #0] |jdbc.QueryConditionBuilder|: Error during condition build java.lang.ArrayIndexOutOfBoundsException: 0 at java.beans.XMLDecoder.readObject(XMLDecoder.java:250) at org.apache.hive.storage.jdbc.QueryConditionBuilder.createConditionString(QueryConditionBuilder.java:125) at org.apache.hive.storage.jdbc.QueryConditionBuilder.buildCondition(QueryConditionBuilder.java:74) at org.apache.hive.storage.jdbc.conf.JdbcStorageConfigManager.getQueryToExecute(JdbcStorageConfigManager.java:84) at org.apache.hive.storage.jdbc.dao.GenericJdbcDatabaseAccessor.getTotalNumberOfRecords(GenericJdbcDatabaseAccessor.java:98) at org.apache.hive.storage.jdbc.JdbcInputFormat.getSplits(JdbcInputFormat.java:70) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 49 2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 49. 29 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat 2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml 2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.SerializationUtilities|: Deserializing MapWork using kryo 2022-02-08 20:43:40,714 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB 2022-02-08 20:43:40,720 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: # Src groups for split generation: 2 2022-02-08 20:43:40,723 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 49 for bucket 1 2022-02-08 20:43:40,725 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez 2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired splits: 49 too small. Desired splitLength: 1972178859 Max splitLength: 1073741824 New desired splits: 90 Total length: 96636764119 Original splits: 49 2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Using original number of splits: 49 desired splits: 90 2022-02-08 20:43:40,727 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Original split count is 49 grouped split count is 49, for bucket: 1 2022-02-08 20:43:40,729 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of split groups: 49 {code} When testing, I also created another external table pointing to another MySQL table with 470559 rows. The MySQL table uses partitions. All queries always use 1 mapper and I have not seen duplicated rows yet, but COMPUTE STATISTICS also works incorrectly - it sets +numRows+ to 11303242. -- This message was sent by Atlassian Jira (v8.20.1#820001)