[GitHub] [hudi] nsivabalan edited a comment on issue #4784: [SUPPORT] Partition column not appearing in spark dataframe

GitBox Fri, 11 Feb 2022 07:28:03 -0800


nsivabalan edited a comment on issue #4784:
URL: https://github.com/apache/hudi/issues/4784#issuecomment-1036330505



   I could not reproduce the partitioning issue you are facing. I could see my 
partition is well formed and I could see the two original columns which i used 
to generate the partition col as well.
   
   local spark shell
   ```
   
   import java.sql.Timestamp
   import spark.implicits._
   
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   
   val df1 = Seq(
           ("row1", 1, "part1" ,1578283932000L ),
           ("row2", 1, "part1", 1578283942000L)
         ).toDF("row", "ppath", "preComb","eventTime")
   
   
    df1.write.format("hudi").
           options(getQuickstartWriteConfigs).
           option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
           option(RECORDKEY_FIELD_OPT_KEY, "row").
           option(PARTITIONPATH_FIELD_OPT_KEY, 
"preComb:simple,ppath:timestamp").
           
option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.CustomKeyGenerator").
           
option("hoodie.deltastreamer.keygen.timebased.timestamp.type","EPOCHMILLISECONDS").
           
option("hoodie.deltastreamer.keygen.timebased.output.dateformat","yyyy-MM-dd").
           option("hoodie.deltastreamer.keygen.timebased.timezone","GMT+8:00").
           option(TABLE_NAME, "timestamp_tbl4").
           mode(Overwrite).
           save("/tmp/hudi_timestamp_tbl4")
   
   
   val hudiDF4 = spark.read.format("hudi").load("/tmp/hudi_timestamp_tbl4")
   hudiDF4.registerTempTable("tbl4")
   spark.sql("describe tbl4").show()
   spark.sql("select * from tbl4 limit 3").show()
   
   ```
   
   Output
   ```
   spark.sql("select * from tbl4 limit 3").show()
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| row|    eventTime|preComb|ppath|
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   |  20220211102107283|20220211102107283...|              row1|      
part1/1970-01-01|dfc23d4b-8177-4fa...|row1|1578283932000|  part1|    0|
   |  20220211102107283|20220211102107283...|              row2|      
part1/1970-01-01|dfc23d4b-8177-4fa...|row2|1578283942000|  part1|    0|
   
+-------------------+--------------------+------------------+----------------------+--------------------+----+-------------+-------+-----+
   ```
   
   specifically values for _hoodie_partition_path are 
   part1/1970-01-01
   
   2: if you disable hive style partitioning, you may not see the "fieldname=". 
But if you want to enable it, don't think hudi allows changing the fieldname 
for partition paths. 
   3: I am not sure on how to leverage partition pruning for custom key gen 
based tables. @xushiyan @YannByron @bhasudha @codope : do you folks have any 
pointers here. 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan edited a comment on issue #4784: [SUPPORT] Partition column not appearing in spark dataframe

Reply via email to