[ https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146914#comment-16146914 ]
Prasanth Jayachandran commented on HIVE-17280: ---------------------------------------------- That is certainly not the format that hive expects. After concatenation, merged and unmerged (incompatible) files gets moved to a staging directory. Then MoveTask moves the files from staging directory to final destination directory (which is also the source directory in case of concatenation). There are certain assumptions around filenames for bucketing, speculative execution etc. in move task. In the example files that you had provided, part-00000_copy_1 and part-00001_copy_1 will be considered same file written by different tasks (from speculative execution) and the largest file will be picked as the winner of speculated execution. This is the same issue as HIVE-17403. Hive usually writes files with format 000000_0 where 000000 is task id/bucket id and digit after _ is considered a task attempt. I am working on a patch that will restrict concatenation for external tables. And for hive managed tables, load data command will make sure the filenames conform to Hive's expectation. > Data loss in CONCATENATE ORC created by Spark > --------------------------------------------- > > Key: HIVE-17280 > URL: https://issues.apache.org/jira/browse/HIVE-17280 > Project: Hive > Issue Type: Bug > Components: Hive, Spark > Affects Versions: 1.2.1 > Environment: Spark 1.6.3 > Reporter: Marco Gaido > Priority: Critical > > Hive concatenation causes data loss if the ORC files in the table were > written by Spark. > Here are the steps to reproduce the problem: > - create a table; > {code:java} > hive > hive> create table aa (a string, b int) stored as orc; > {code} > - insert 2 rows using Spark; > {code:java} > spark-shell > scala> case class AA(a:String, b:Int) > scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF > scala> df.write.insertInto("aa") > {code} > - change table schema; > {code:java} > hive > hive> alter table aa add columns(aa string, bb int); > {code} > - insert other 2 rows with Spark > {code:java} > spark-shell > scala> case class BB(a:String, b:Int, aa:String, bb:Int) > scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF > scala> df.write.insertInto("aa") > {code} > - at this point, running a select statement with Hive returns correctly *4 > rows* in the table; then run the concatenation > {code:java} > hive > hive> alter table aa concatenate; > {code} > At this point, a select returns only *3 rows, ie. a row is missing*. -- This message was sent by Atlassian JIRA (v6.4.14#64029)