yihua commented on code in PR #11770:
URL: https://github.com/apache/hudi/pull/11770#discussion_r1720385904
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -271,7 +271,7 @@ class
HoodieMergeOnReadIncrementalHadoopFsRelationFactory(override val sqlContex
isBootstrap: Boolean)
extends HoodieMergeOnReadSnapshotHadoopFsRelationFactory(sqlContext,
metaClient, options, schemaSpec, isBootstrap) {
- override val mandatoryFields: Seq[String] =
Seq(HoodieRecord.COMMIT_TIME_METADATA_FIELD) ++ mandatoryFieldsForMerging
+ override val mandatoryFields: Seq[String] =
Seq(HoodieRecord.COMMIT_TIME_METADATA_FIELD) ++ (if (isBootstrap) Seq.empty
else partitionColumnsToRead.toSeq)
Review Comment:
You mentioned that "all this (mandatoryFieldsForMerging) is computed in the
fg reader so we don't need this". Do we need to move `mandatoryFields` logic
to the file group reader too (OK to keep it in this PR; sth to consider for
maintenability and cross-engine support, e.g., right now, reading partition
values from data files is only supported in Spark?).
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val
sqlContext: SQLContext,
protected lazy val basePath: StoragePath = metaClient.getBasePath
protected lazy val partitionColumns: Array[String] =
tableConfig.getPartitionFields.orElse(Array.empty)
+ //placeholder until pr to persist field types lands
+ protected lazy val partitionColumnsToRead: Array[String] = if
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else
tableConfig.getPartitionFields.orElse(Array.empty)
+
Review Comment:
Could we add a logic in this PR that only used
`tableConfig.getPartitionFields` if timestamp or custom key generator is used,
since other key generators do not change the partition value so it does not
matter whether the partition values come from the file path or the data file?
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val
sqlContext: SQLContext,
protected lazy val basePath: StoragePath = metaClient.getBasePath
protected lazy val partitionColumns: Array[String] =
tableConfig.getPartitionFields.orElse(Array.empty)
+ //placeholder until pr to persist field types lands
+ protected lazy val partitionColumnsToRead: Array[String] = if
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else
tableConfig.getPartitionFields.orElse(Array.empty)
Review Comment:
I think we should deprecate
`hoodie.datasource.read.extract.partition.values.from.path`. For
timestamp-typed partition column, the query should directly read values from
the data files.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val
sqlContext: SQLContext,
protected lazy val basePath: StoragePath = metaClient.getBasePath
protected lazy val partitionColumns: Array[String] =
tableConfig.getPartitionFields.orElse(Array.empty)
+ //placeholder until pr to persist field types lands
+ protected lazy val partitionColumnsToRead: Array[String] = if
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else
tableConfig.getPartitionFields.orElse(Array.empty)
+
Review Comment:
In @lokeshj1703 's PR, we can further restrain the condition if the
simple-typed partition is used in custom key generator.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]