hudi-bot opened a new issue, #14994:
URL: https://github.com/apache/hudi/issues/14994
recently, if partition's value has the format like
"pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition
automatically. The directory of this table will have multi partition structure.
I think it's unpredictable. So create this umbrella task to optimize auto
partition in order to make the behavior more reasonable.
Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
There are a few of sub tasks:
* add a flag to control whether enable auto-partition, to make the default
behavior reasonable..
* achieve a new key generator designed specifically for this scenario.
* solve the bug about the different schema when enable
*hoodie.file.index.enable* or not in this case.
Test Codes:
{code:java}
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath",
"(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
newDf.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath) {code}
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-3214
- Type: Improvement
---
## Comments
28/Feb/22 14:16;xushiyan;[[email protected]] what is the plan for this
ticket? is it still a valid improvement?;;;
---
01/Mar/22 02:41;[email protected];[~xushiyan] [~shivnarayan] I think no
new configs or key generator needed here. i plan to enable
`hoodie.datasource.write.partitionpath.urlencode` and
`hoodie.datasource.write.hive_style_partitioning` by default. And if users want
to auto discover partition from the partitionpath, they can disable
`hoodie.datasource.write.partitionpath.urlencode`.;;;
---
20/Aug/22 17:20;xushiyan;need to triage if this is resolved already;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]