[
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kobefeng updated SPARK-20663:
-----------------------------
Labels: (was: easyfix)
> Data missing after insert overwrite table partition which is created on
> specific location
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.1.0
> Reporter: kobefeng
>
> Use spark sql to create partition table first, and alter table by adding
> partition on specific location, then insert overwrite into this partition by
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
> > id bigint,
> > name string,
> > dt string
> > ) using parquet options ('compression'='snappy',
> 'path'='/user/kofeng/partitioned_table')
> > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone,
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06
> /user/kofeng/partitioned_table/_SUCCESS
> ----
> ----
> -- Then drop this partition and use hive to add partition and insert
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive insert overwrite that specific location successfully.
> hive> insert overwrite table kofeng.partitioned_table
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
> hive> select * from kofeng.partitioned_table;
> OK
> 123 kofeng 20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/20170507
> -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26
> /user/kofeng/partitioned_table/20170507/000000_0
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]