Witold Drabicki created HIVE-25946:
--------------------------------------

             Summary: select from external table pointing to MySQL returns 
multiple copies of the same row
                 Key: HIVE-25946
                 URL: https://issues.apache.org/jira/browse/HIVE-25946
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.3.7
         Environment: Hive runs on *GCP Dataproc,* image version is 
*1.5.56-debian10* (Hive v {*}2.3.7{*})

*MySQL* server version is {*}5.7.36{*}.

The following jars are used:
{code:java}
add jar gs://d-test-bucket-1/commons-pool-1.6.jar;
add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar;
add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar;
add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar;  (identical 
behavior when using mysql-connector-java-5.1.49){code}
            Reporter: Witold Drabicki


The following table has been created in Hive:

 
{code:java}
CREATE EXTERNAL TABLE table_with_4_rows
(
  col1 varchar(100),
  col2 varchar(15),
  col3 TIMESTAMP,    
  col4 TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
    "hive.sql.dbcp.username" = "<name>",
    "hive.sql.dbcp.password" = "<password>",
    "hive.sql.table" = "TABLE_WITH_4_ROWS",
    "hive.sql.schema" = "schema-name",
    "hive.sql.query" = "select col1, col2, col3, col4 from 
schema-name.TABLE_WITH_4_ROWS",
    "hive.sql.numPartitions" = "1",
    "hive.sql.dbcp.maxActive" = "1"
);{code}
 

The table in MySQL has just 4 rows, and is defined as:

 
{code:java}
CREATE TABLE `TABLE_WITH_4_ROWS` (
  `col1` varchar(100) NOT NULL DEFAULT '',
  `col2` varchar(15) DEFAULT NULL,
  `col3` datetime DEFAULT NULL,
  `col4` datetime DEFAULT NULL,
  PRIMARY KEY (`col1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;{code}
 

When cluster is *not 100% busy* and has idle containers, running *select col1, 
col2 from table_with_4_rows* results in a job that uses 49 mappers and no 
reducers, and returns 187 rows, instead of 4 (each original row is duplicated 
multiple times in the results).

Running the same select but with *WHERE col1 = 'specific-value'* also uses 49 
mappers and instead of returning 1 row also returns duplicated data (46 to 48 
rows, depending on the value).

When cluster is *100% busy* and the job needs to reclaim containers from other 
jobs, the above queries use just 1 mapper and *return correct data* (4 and 1 
row, correspondingly).

Running *ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS* does not change 
the results, however, it also works incorrectly as it sets +numRows+ in the 
table's metadata also to 187.

There's *ArrayIndexOutOfBoundsException* *Error during condition build* 
exception thrown during the query execution. Here's the output from the *log* 
file:

 
{code:java}
2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
Continuing ...
2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set 
#tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 
[Map 1]
2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] 
|Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is 
deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: 
Found configuration file null
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] 
|tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] 
|tez.HiveSplitGenerator|: SplitLocationProvider: 
org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567
2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: 
PLAN PATH = 
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] 
|exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: 
Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0] 
|dag.RootInputInitializerManager|: Starting InputInitializer for Input: 
table_with_4_rows on vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:40,277 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: GenerateConsistentSplitsInHive=true
2022-02-08 20:43:40,287 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
vertex_1644267138354_0004_1_00 [Map 1] transitioned from NEW to INITIALIZING 
due to event V_INIT
2022-02-08 20:43:40,288 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: The preferred split size is 52428800
2022-02-08 20:43:40,289 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: 
PLAN PATH = 
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: 
Processing alias table_with_4_rows
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: 
Adding 1 inputs; the first input is 
hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] 
|io.HiveInputFormat|: hive.io.file.readcolumn.ids = 0,1
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] 
|io.HiveInputFormat|: hive.io.file.readcolumn.names = col1,col2
2022-02-08 20:43:40,300 [INFO] [InputInitializer {Map 1} #0] 
|io.HiveInputFormat|: Generating splits for dirs: 
hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,642 [ERROR] [InputInitializer {Map 1} #0] 
|jdbc.QueryConditionBuilder|: Error during condition build
java.lang.ArrayIndexOutOfBoundsException: 0
    at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
    at 
org.apache.hive.storage.jdbc.QueryConditionBuilder.createConditionString(QueryConditionBuilder.java:125)
    at 
org.apache.hive.storage.jdbc.QueryConditionBuilder.buildCondition(QueryConditionBuilder.java:74)
    at 
org.apache.hive.storage.jdbc.conf.JdbcStorageConfigManager.getQueryToExecute(JdbcStorageConfigManager.java:84)
    at 
org.apache.hive.storage.jdbc.dao.GenericJdbcDatabaseAccessor.getTotalNumberOfRecords(GenericJdbcDatabaseAccessor.java:98)
    at 
org.apache.hive.storage.jdbc.JdbcInputFormat.getSplits(JdbcInputFormat.java:70)
    at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
    at 
org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
    at 
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
    at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at 
org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] 
|io.HiveInputFormat|: number of splits 49
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: Number of input splits: 49. 29 available slots, 1.7 
waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: 
PLAN PATH = 
hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] 
|exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,714 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: 
Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,720 [INFO] [InputInitializer {Map 1} #0] 
|tez.SplitGrouper|: # Src groups for split generation: 2
2022-02-08 20:43:40,723 [INFO] [InputInitializer {Map 1} #0] 
|tez.SplitGrouper|: Estimated number of tasks: 49 for bucket 1
2022-02-08 20:43:40,725 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Grouping splits in Tez
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Desired splits: 49 too small.  Desired splitLength: 
1972178859 Max splitLength: 1073741824 New desired splits: 90 Total length: 
96636764119 Original splits: 49
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] 
|grouper.TezSplitGrouper|: Using original number of splits: 49 desired splits: 
90
2022-02-08 20:43:40,727 [INFO] [InputInitializer {Map 1} #0] 
|tez.SplitGrouper|: Original split count is 49 grouped split count is 49, for 
bucket: 1
2022-02-08 20:43:40,729 [INFO] [InputInitializer {Map 1} #0] 
|tez.HiveSplitGenerator|: Number of split groups: 49
{code}
When testing, I also created another external table pointing to another MySQL 
table with 470559 rows. The MySQL table uses partitions. All queries always use 
1 mapper and I have not seen duplicated rows yet, but COMPUTE STATISTICS also 
works incorrectly - it sets +numRows+ to 11303242.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to