AngersZhuuuu edited a comment on issue #25979: [SPARK-29295][SQL] Insert 
overwrite to Hive external table partition should delete old data
URL: https://github.com/apache/spark/pull/25979#issuecomment-552833762
 
 
   > As I quickly tried on Hive, Hive does not delete existing directory with 
"INSERT OVERWRITE", if the external partition was dropped by "DROP PARTITION" 
before. Hive just moves data from staging into the directory in this case.
   > 
   > Whether it produces duplicated data, depending on the filename generated. 
I tested two version of Hive locally. I do not know the logic Hive produces 
data filename.
   > 
   > On Hive 2.1.0, two "INSERT OVERWRITE" produces data file with same name 
like 000000_0. The second "INSERT OVERWRITE" moves the file into and overwrite 
old file.
   > 
   > On Hive 2.3.2, the second "INSERT OVERWRITE" causes following failure when 
moving file with same name:
   > 
   > ```
   > 0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE test(id int) 
PARTITIONED BY (name string) STORED AS TEXTFILE LOCATION '/tmp/test';
   > No rows affected (0.074 seconds)
   > 0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE test 
PARTITION(name='n1') SELECT 1;
   > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
the future versions. Consider using a different execution engine (i.e. spark, 
tez) or using Hive 1.X releases.
   > No rows affected (1.809 seconds)
   > 0: jdbc:hive2://localhost:10000> ALTER TABLE test DROP 
PARTITION(name='n1');
   > No rows affected (0.175 seconds)
   > 0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE test 
PARTITION(name='n1') SELECT 2;
   > WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in 
the future versions. Consider using a different execution engine (i.e. spark, 
tez) or using Hive 1.X releases.
   > Error: org.apache.hive.service.cli.HiveSQLException: Error while 
processing statement: FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: rename for src 
path: 
hdfs://namenode:8020/tmp/test/name=n1/.hive-staging_hive_2019-10-02_16-01-28_570_7296144361967433532-1/-ext-10000/000000_0
 to dest path:hdfs://namenode:8020/tmp/test/name=n1/000000_0 returned false
   >    at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)
   >    at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257)
   >    at 
org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91)
   >    at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348)
   >    at java.security.AccessController.doPrivileged(Native Method)
   >    at javax.security.auth.Subject.doAs(Subject.java:422)
   >    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
   >    at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362)
   >    at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   >    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   >    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   >    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   >    at java.lang.Thread.run(Thread.java:748)
   > ```
   > 
   > So, I think it is sure that Hive does not delete existing directory when 
"INSERT OVERWRITE" into a non-existing partition. Whether duplicated data are 
seen, it depends on if filenames are unique or not between different "INSERT 
OVERWRITE".
   > 
   > If your first "INSERT OVERWRITE" produces filename "000000_0" and 
"000001_0", but the second "INSERT OVERWRITE" produces just "000000_0", you 
will read partial old data "000001_0" with new "000000_0" together.
   > 
   > Above tests are using MR as Hive execution engine. I am not sure if Hive 
has different output filename in different execution engine like Spark or Tez.
   
   Hive use `mv` here , so if name is same, it will be replaced, if your 
partition was dropped remain two files
   ```
   0000000_0
   0000001_0
   ```
   Then you insert overwrite into this partition with one file `0000000_0`, 
then the old data `0000001_0` still remains. 
   
   In spark, each result's file name is different....so, data dunplicted.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to