[GitHub] [spark] roczei commented on pull request #39595: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

via GitHub Fri, 27 Jan 2023 11:21:01 -0800


roczei commented on PR #39595:
URL: https://github.com/apache/spark/pull/39595#issuecomment-1406970882


   Hi @czxm, 
   
   Thanks for this fix! I tested it with Cloudera Spark 3.3.0 where the Hive 
version is 3.1.3000 and it includes this Hive improvement:
   
   Support hive.metastore.limit.partition.request for get_partitions_ps: 
HIVE-23556
   
   Test case 1 (default setting: spark.sql.hasCustomPartitionLocations=true)
   
   Created a hive-site.xml file with the following content:
   
   ```
   <?xml version="1.0" encoding="UTF-8"?>
   
   <configuration>
   <property>
      <name>hive.metastore.limit.partition.request</name>
      <value>3</value>
   </property>
   </configuration>
   
   
   ./spark-shell --conf spark.sql.catalogImplementation=hive --jars 
./hive-site.xml
   ```
   
   ```
   spark.sql("create table student53 (student_name string) using parquet 
partitioned by (father_name string, percentage  float, section string)")
   spark.sql("insert into student53 values('dikshant','ashok', 95, 'A')")
   spark.sql("insert into student53 values('raj','ramesh', 90, 'B')")
   spark.sql("insert into student53 values('alok','ajit', 85, 'C')")
   spark.sql("insert into student53 values('rajendra','naveen', 70, 'D')")
   spark.sql("create table student63 (student_name string) using parquet 
partitioned by (father_name string, percentage  float, section string)")
   spark.sql("insert into student63 values('dikshant','ashok', 95, 'A')")
   spark.sql("insert into student63 values('raj','ramesh', 90, 'B')")
   spark.sql("insert into student63 values('alok','ajit', 85, 'C')")
   spark.sql("insert into student63 values('rajendra','naveen', 70, 'D')")
   ```
   
   Query 1)
   
   ```
   spark.sql("insert overwrite table student63 partition (section, father_name, 
percentage) select * from student53 where section='A'")
   ```
   
   failed with this error (this is expected):
   
   ```
   Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Number of 
partitions scanned (=4) on table 'student31' exceeds limit (=3). This is 
controlled on the metastore server by metastore.limit.partition.request.
   ```
   
   Query 2)
   
   ```
   spark.sql("insert overwrite table student63 select * from student53 where 
section='C'")
   ```
   
   failed with this error (this is expected):
   
   ```
   Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Number of 
partitions scanned (=4) on table 'student31' exceeds limit (=3). This is 
controlled on the metastore server by metastore.limit.partition.request.
   ```
   
   The number of partitions are 4 and it tries to do a full partition scan, 
therefore we have reached the hive.metastore.limit.partition.request limit 
which is currently 3. 
   
   
   Test case 2)
   
   Added this extra parameter to the spark-shell what you have introduced in 
this pull request:
   
   ```
   --conf spark.sql.hasCustomPartitionLocations=false
   ```
   
   ```
   $ ./spark-shell  --conf spark.sql.catalogImplementation=hive --jars 
./hive-site.xml --conf spark.sql.hasCustomPartitionLocations=false
   ```
   
   Results:
   
   ```
   scala> spark.sql("insert overwrite table student63 partition (section, 
father_name, percentage) select * from student53 where section='A'")
   res10: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("insert overwrite table student63 select * from student53 
where section='C'")
   res11: org.apache.spark.sql.DataFrame = []
   scala> 
   ```
   
   As you can see it works for me and it resolved the above Hive limit problem.
   
   Before we involve the Spark committers,  please create a unit test which 
includes this new configuration parameter: 
`spark.sql.hasCustomPartitionLocations=false`. For example I would add it to 
this test file:
   
   sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
   
   Related document:
   
   https://spark.apache.org/developer-tools.html#running-individual-tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] roczei commented on pull request #39595: [SPARK-38230][SQL] InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

Reply via email to