[GitHub] [hudi] garyli1019 commented on a change in pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

GitBox Thu, 18 Feb 2021 00:05:24 -0800


garyli1019 commented on a change in pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#discussion_r578205722




##########
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
##########
@@ -504,6 +506,42 @@ class TestMORDataSource extends HoodieClientTestBase {
     hudiSnapshotDF2.show(1)
   }
 
+  @Test
+  def testPrunePartitions() {
+    // First Operation:
+    // Producing parquet files to three hive style partitions like 
/partition=20150316/.
+    // SNAPSHOT view on MOR table with parquet files only.
+    dataGen.setPartitionPaths(Array("20150316","20150317","20160315"));
+    val records1 = recordsToStrings(dataGen.generateInserts("001", 100)).toList
+    val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+    inputDF1.write.format("org.apache.hudi")

Review comment:
       thanks for the test, I see how it works now, so we are actually talking 
about two different partition pruning mechanisms.
   In this PR, the input data frame already has a `partition` field before 
write and spark will map the `partition` field to the actual folder partition 
in the file system.
   The way I was talking about is that the input data frame does not has a 
`partition` field, and spark will append a `partition` field while reading. 
This is how the partition discovery of parquet file works 
https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery.
 If the data frame has a field with the same name, then it will throw an error.
   @yui2010 Did I understand correctly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] garyli1019 commented on a change in pull request #2378: [HUDI-1491] Support partition pruning for MOR snapshot query

Reply via email to