[GitHub] [spark] khalidmammadov opened a new pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

GitBox Sun, 06 Feb 2022 15:49:10 -0800


khalidmammadov opened a new pull request #35409:
URL: https://github.com/apache/spark/pull/35409



   ### What changes were proposed in this pull request?
   
   
   HiveExternalCatalog.listPartitions method call is failing when a partition 
column name is upper case and partition value contains dot. It's related to 
this change 
https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23
   
   The test case in that PR does not produce the issue as partition column name 
is lower case.
   
   This change will lowercase the partition column name during comparison to 
produce expected result, it's is inline with the actual spec transformation 
i.e. making it lower case for Hive and using the same function
   
    
   
   Below how to reproduce the issue:
   ```
   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> import org.apache.spark.sql.catalyst.TableIdentifier
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
(partCol1 STRING, partCol2 STRING)")
   22/02/06 21:10:45 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 
= 'i.j') VALUES (100, 'John')")
   res1: org.apache.spark.sql.DataFrame = []                                    
   
   
   scala> 
spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
Some(Map("partCol2" -> "i.j"))).foreach(println)
   java.util.NoSuchElementException: key not found: partcol2
     at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
     at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
     at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
     at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
     at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
     at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
     at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
     at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
     at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
     at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
     at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
     at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
     at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
     at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
     at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
     at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
     at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
     ... 47 elided
   
   
   *******AFTER FIX*********
   
   scala> import org.apache.spark.sql.catalyst.TableIdentifier
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
(partCol1 STRING, partCol2 STRING)")
   22/02/06 22:08:11 WARN ResolveSessionCatalog: A Hive serde table will be 
created as there is no table provider specified. You can set 
spark.sql.legacy.createHiveTableByDefault to false so that native data source 
table will be created instead.
   res1: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 
= 'i.j') VALUES (100, 'John')")
   res2: org.apache.spark.sql.DataFrame = []                                    
   
   
   scala> 
spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
Some(Map("partCol2" -> "i.j"))).foreach(println)
   CatalogPartition(
        Partition Values: [partCol1=CA, partCol2=i.j]
        Location: 
file:/home/khalid/dev/oss/test/spark-warehouse/customer/partcol1=CA/partcol2=i.j
        Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Storage Properties: [serialization.format=1]
        Partition Parameters: {rawDataSize=0, numFiles=1, 
transient_lastDdlTime=1644185314, totalSize=9, 
COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=0}
        Created Time: Sun Feb 06 22:08:34 GMT 2022
        Last Access: UNKNOWN
        Partition Statistics: 9 bytes)
   
   ```
   
   
   ### Why are the changes needed?
   It fixes the bug
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   
   `build/sbt -v -d "test:testOnly *CatalogSuite"`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] khalidmammadov opened a new pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Reply via email to