Woong Seok Kang created SPARK-30769:
---------------------------------------
Summary: insertInto() with existing column as partition key cause
weird partition result
Key: SPARK-30769
URL: https://issues.apache.org/jira/browse/SPARK-30769
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.4.4
Environment: EMR 5.29.0 with Spark 2.4.4
Reporter: Woong Seok Kang
{code:java}
val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn,
typedLit[String](date.toString)))
if (xsc.tableExistIn(config.service, saveDatabase,
s"${config.table}_partitioned")) writer.insertInto(tableName)
else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
This code checks whether table exists in desired path. (somewhere in S3 in this
case) If table already exists in path then insert a new partition with
insertInto() function.
If config.dateColumn not exists in table schema, no problem occurred. (just new
column will be added) but if it is already exists in schema, Spark does not use
given column as a partition key, instead it will create a hundred of
partitions. Below is a part of Spark logs:
(Note that the name of partition column is date_ymd, which is already exists in
source table. original value is a date string like '2020-01-01')
20/02/10 05:33:01 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
20/02/10 05:33:04 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
20/02/10 05:33:05 INFO S3NativeFileSystem2: rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
rename
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__
When I use different partition key which not in table schema such as
'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I
just wrote the report. (I think it is related with Hive...)
Thanks for reading!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]