hudi-bot opened a new issue, #14994:
URL: https://github.com/apache/hudi/issues/14994

   recently, if partition's value has the format like 
"pt1=xxxx/pt2=yyyy/pt3=zzzz" which split by slash, Hudi will partition 
automatically. The directory of this table will have multi partition structure.
   
   I think it's unpredictable. So create this umbrella task to optimize auto 
partition in order to make the behavior more reasonable.
   
   Also, in hudi 0.8, schama will hold `pt1`, `pt2`, `pt3`, but not in 0.9+.
   
   There are a few of sub tasks:
    * add a flag to control whether enable auto-partition, to make the default 
behavior reasonable..
    * achieve a new key generator designed specifically for this scenario.
    * solve the bug about the different schema when enable 
*hoodie.file.index.enable* or not in this case.
   
    
   
   Test Codes: 
   {code:java}
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_cow"
   val basePath = "file:///tmp/hudi_trips_cow"
   val dataGen = new DataGenerator
   val inserts = convertToStringList(dataGen.generateInserts(10))
   
   val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
   val newDf = df.withColumn("partitionpath", regexp_replace($"partitionpath", 
"(.*)(\\/){1}(.*)(\\/){1}", "continent=$1$2country=$3$4city="))
   
   newDf.write.format("hudi").
   options(getQuickstartWriteConfigs).
   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
   option(TABLE_NAME, tableName).
   mode(Overwrite).
   save(basePath) {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3214
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   28/Feb/22 14:16;xushiyan;[[email protected]] what is the plan for this 
ticket? is it still a valid improvement?;;;
   
   ---
   
   01/Mar/22 02:41;[email protected];[~xushiyan] [~shivnarayan] I think no 
new configs or key generator needed here. i plan to enable 
`hoodie.datasource.write.partitionpath.urlencode` and 
`hoodie.datasource.write.hive_style_partitioning` by default. And if users want 
to auto discover partition from the partitionpath, they can disable 
`hoodie.datasource.write.partitionpath.urlencode`.;;;
   
   ---
   
   20/Aug/22 17:20;xushiyan;need to triage if this is resolved already;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to