[jira] [Created] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

Woong Seok Kang (Jira) Sun, 09 Feb 2020 23:09:58 -0800

Woong Seok Kang created SPARK-30769:
---------------------------------------


             Summary: insertInto() with existing column as partition key cause 
weird partition result
                 Key: SPARK-30769
                 URL: https://issues.apache.org/jira/browse/SPARK-30769
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.4
         Environment: EMR 5.29.0 with Spark 2.4.4
            Reporter: Woong Seok Kang


{code:java}
val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, 
typedLit[String](date.toString))) 

if (xsc.tableExistIn(config.service, saveDatabase, 
s"${config.table}_partitioned")) writer.insertInto(tableName)
else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
This code checks whether table exists in desired path. (somewhere in S3 in this 
case) If table already exists in path then insert a new partition with 
insertInto() function.

If config.dateColumn not exists in table schema, no problem occurred. (just new 
column will be added) but if it is already exists in schema, Spark does not use 
given column as a partition key, instead it will create a hundred of 
partitions. Below is a part of Spark logs:
(Note that the name of partition column is date_ymd, which is already exists in 
source table. original value is a date string like '2020-01-01')


20/02/10 05:33:01 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
20/02/10 05:33:04 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
20/02/10 05:33:05 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__

When I use different partition key which not in table schema such as 
'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I 
just wrote the report. (I think it is related with Hive...)



Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

Reply via email to