nsivabalan commented on code in PR #9273:
URL: https://github.com/apache/hudi/pull/9273#discussion_r1272554388
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -115,14 +112,16 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
// Note that key generator class name could be null
val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
if
(classOf[TimestampBasedKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
- ||
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
{
+ ||
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+ ||
classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+ ||
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
{
Review Comment:
we might need to consider one more thing here.
looks like for timestamp based partition field, we are harding the data type
to string.
so, within custom key gen, there could be more than 1 field. and so we need
to intercept the type of each field and only hard code the data type to string
if its of timestamp type.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -342,6 +342,20 @@ object HoodieFileIndex extends Logging {
if (listingModeOverride != null) {
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
listingModeOverride)
}
+ val tableConfig = metaClient.getTableConfig
+ val partitionColumns = tableConfig.getPartitionFields
+ if (partitionColumns.isPresent) {
+ val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
+ // NOTE: A custom key generator with multiple fields could have
non-encoded slashes in the partition columns'
+ // value. We might not be able to properly parse partition-values
from the listed partition-paths. Fallback
+ // to eager listing in this case.
+ val isCustomKeyGenerator =
(classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+ ||
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
+ val hasMultiplePartitionFields = partitionColumns.get().length > 1
+ if (hasMultiplePartitionFields && isCustomKeyGenerator) {
+
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
DataSourceReadOptions.FILE_INDEX_LISTING_MODE_EAGER)
+ }
+ }
Review Comment:
should we also consider complexKeyGen w/ multiple fields.
so all that matters here is, if there are more than 1 partitionColumns. we
don't really need to check for key gen class.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -115,14 +112,16 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
// Note that key generator class name could be null
val keyGeneratorClassName = tableConfig.getKeyGeneratorClassName
if
(classOf[TimestampBasedKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
- ||
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
{
+ ||
classOf[TimestampBasedAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+ ||
classOf[CustomKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName)
+ ||
classOf[CustomAvroKeyGenerator].getName.equalsIgnoreCase(keyGeneratorClassName))
{
Review Comment:
we can take it as a follow up as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]