garyli1019 commented on a change in pull request #2378:
URL: https://github.com/apache/hudi/pull/2378#discussion_r578205722
##########
File path:
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
##########
@@ -504,6 +506,42 @@ class TestMORDataSource extends HoodieClientTestBase {
hudiSnapshotDF2.show(1)
}
+ @Test
+ def testPrunePartitions() {
+ // First Operation:
+ // Producing parquet files to three hive style partitions like
/partition=20150316/.
+ // SNAPSHOT view on MOR table with parquet files only.
+ dataGen.setPartitionPaths(Array("20150316","20150317","20160315"));
+ val records1 = recordsToStrings(dataGen.generateInserts("001", 100)).toList
+ val inputDF1 = spark.read.json(spark.sparkContext.parallelize(records1, 2))
+ inputDF1.write.format("org.apache.hudi")
Review comment:
thanks for the test, I see how it works now, so we are actually talking
about two different partition pruning mechanisms.
In this PR, the input data frame already has a `partition` field before
write and spark will map the `partition` field to the actual folder partition
in the file system.
The way I was talking about is that the input data frame does not has a
`partition` field, and spark will append a `partition` field while reading.
This is how the partition discovery of parquet file works
https://spark.apache.org/docs/2.4.0/sql-data-sources-parquet.html#partition-discovery.
If the data frame has a field with the same name, then it will throw an error.
@yui2010 Did I understand correctly?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]