Re: [PR] [HUDI-5807] read partition values from file and create infra to support reading only a subset of columns [hudi]

via GitHub Fri, 16 Aug 2024 14:50:46 -0700


yihua commented on code in PR #11770:
URL: https://github.com/apache/hudi/pull/11770#discussion_r1720385904



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -271,7 +271,7 @@ class 
HoodieMergeOnReadIncrementalHadoopFsRelationFactory(override val sqlContex
                                                           isBootstrap: Boolean)
   extends HoodieMergeOnReadSnapshotHadoopFsRelationFactory(sqlContext, 
metaClient, options, schemaSpec, isBootstrap) {
 
-  override val mandatoryFields: Seq[String] = 
Seq(HoodieRecord.COMMIT_TIME_METADATA_FIELD) ++ mandatoryFieldsForMerging
+  override val mandatoryFields: Seq[String] = 
Seq(HoodieRecord.COMMIT_TIME_METADATA_FIELD) ++ (if (isBootstrap) Seq.empty 
else partitionColumnsToRead.toSeq)

Review Comment:
   You mentioned that "all this (mandatoryFieldsForMerging) is computed in the 
fg reader so we don't need this".  Do we need to move `mandatoryFields` logic 
to the file group reader too (OK to keep it in this PR; sth to consider for 
maintenability and cross-engine support, e.g., right now, reading partition 
values from data files is only supported in Spark?).



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
   protected lazy val basePath: StoragePath = metaClient.getBasePath
   protected lazy val partitionColumns: Array[String] = 
tableConfig.getPartitionFields.orElse(Array.empty)
 
+  //placeholder until pr to persist field types lands
+  protected lazy val partitionColumnsToRead: Array[String] = if 
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else 
tableConfig.getPartitionFields.orElse(Array.empty)
+

Review Comment:
   Could we add a logic in this PR that only used 
`tableConfig.getPartitionFields` if timestamp or custom key generator is used, 
since other key generators do not change the partition value so it does not 
matter whether the partition values come from the file path or the data file?



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
   protected lazy val basePath: StoragePath = metaClient.getBasePath
   protected lazy val partitionColumns: Array[String] = 
tableConfig.getPartitionFields.orElse(Array.empty)
 
+  //placeholder until pr to persist field types lands
+  protected lazy val partitionColumnsToRead: Array[String] = if 
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else 
tableConfig.getPartitionFields.orElse(Array.empty)

Review Comment:
   I think we should deprecate 
`hoodie.datasource.read.extract.partition.values.from.path`.  For 
timestamp-typed partition column, the query should directly read values from 
the data files.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieHadoopFsRelationFactory.scala:
##########
@@ -75,6 +75,9 @@ abstract class HoodieBaseHadoopFsRelationFactory(val 
sqlContext: SQLContext,
   protected lazy val basePath: StoragePath = metaClient.getBasePath
   protected lazy val partitionColumns: Array[String] = 
tableConfig.getPartitionFields.orElse(Array.empty)
 
+  //placeholder until pr to persist field types lands
+  protected lazy val partitionColumnsToRead: Array[String] = if 
(shouldExtractPartitionValuesFromPartitionPath) Array.empty else 
tableConfig.getPartitionFields.orElse(Array.empty)
+

Review Comment:
   In @lokeshj1703 's PR, we can further restrain the condition if the 
simple-typed partition is used in custom key generator.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-5807] read partition values from file and create infra to support reading only a subset of columns [hudi]

Reply via email to