[GitHub] [hudi] sstimmel opened a new issue, #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

GitBox Mon, 26 Sep 2022 07:17:04 -0700


sstimmel opened a new issue, #6798:
URL: https://github.com/apache/hudi/issues/6798


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am testing out partitioning a dataset by an eventTime which is timestamp 
column, but only want up to day precision for partitioning.  Is there a way to 
read back the original value from hudi instead of the truncated value?
   
   **To Reproduce**
   
   Configs
   hoodie.datasource.write.recordkey.field=companyId
   hoodie.datasource.write.precombine.field=eventTime
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.hive_style_partitioning=false
   hoodie.datasource.hive_sync.enable=false
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.deltastreamer.source.dfs.root=s3://blah/tenantconfig/raw
   hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.datasource.write.partitionpath.field=eventTime
   hoodie.datasource.write.keygen.timebased.timestamp.type=SCALAR
   hoodie.datasource.write.keygen.timebased.timezone=UTC
   
hoodie.datasource.write.keygen.timebased.timestamp.scalar.time.unit=microseconds
   hoodie.datasource.write.keygen.timebased.output.dateformat=yyyy-MM-dd
   hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
   hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=microseconds
   hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd
   hoodie.deltastreamer.keygen.timebased.timezone=UTC
   hoodie.deltastreamer.source.s3incr.fs.prefix=s3a
   hoodie.index.type=GLOBAL_SIMPLE
   hoodie.simple.index.update.partition.path=true
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=200
   hoodie.keep.min.commits=250
   hoodie.keep.max.commits=500
   hoodie.allow.empty.commit=false
   
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true
   hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS
    
    
   If I read a particular partition folder in parquet format, I can get the 
original eventTime values
    
   
spark.read.format("parquet").load(testPath+"/2022-06-14").createOrReplaceTempView("t")
   spark.sql("select eventId, eventTime,companyId from t").show(10, false)
    
    
   +------------------------------------+-----------------------+---------+
   |eventId                             |eventTime              |companyId|
   +------------------------------------+-----------------------+---------+
   |f08eae6d-6103-4b4d-8d3f-348477ab055c|2022-06-14 05:34:49.128|1285302  |
   |49ecd5b2-c782-482f-b796-b008b9091d8b|2022-06-14 05:34:52.83 |1285306  |
   |b6eab34e-9e7d-4365-87ef-36086e18a3a0|2022-06-14 11:00:30.96 |1285489  |
   |1697c79d-0180-42bc-89f8-e29d3bb806c7|2022-06-14 08:27:49.169|1285375  |
   |6ecf4ffe-a937-4d3e-928e-3edfb09becdd|2022-06-14 08:28:21.978|1285379  |
   |cc774a92-ee81-4e41-9228-b636af58e48c|2022-06-14 05:34:07.788|1285261  |
   |c26f12ba-8b6a-4eef-a9ff-3051d65f72d2|2022-06-14 11:02:37.454|1285492  |
   |e70af180-fd97-48d8-9ab2-d386154f9aad|2022-06-14 08:28:24.475|1285380  |
   |7b3be223-05a2-4a77-9136-899eb2fb05d7|2022-06-14 08:31:14.847|1285383  |
   |29c9afa0-5aa9-4c4d-972b-542ea1762daa|2022-06-14 08:31:16.055|1285385  |
   +------------------------------------+-----------------------+---------+
   only showing top 10 rows
    
    
   Reading in hudi format, with the following option, still is returning the 
value as a string
    
    
   val df = 
spark.read.option("hoodie.datasource.read.extract.partition.values.from.path", 
"false").format("org.apache.hudi").load(testPath)
   df.printSchema()
   
spark.read.format("org.apache.hudi").option("hoodie.datasource.read.extract.partition.values.from.path",
 "false").load(testPath).createOrReplaceTempView("temp2")
   spark.sql("select _hoodie_partition_path, eventId, eventTime,companyId from 
temp2").show(10, false)
    
    
   root
   |-- _hoodie_commit_time: string (nullable = true)
   |-- _hoodie_commit_seqno: string (nullable = true)
   |-- _hoodie_record_key: string (nullable = true)
   |-- _hoodie_partition_path: string (nullable = true)
   |-- _hoodie_file_name: string (nullable = true)
   |-- eventId: string (nullable = true)
   |-- companyId: long (nullable = true)
   |-- configId: string (nullable = true)
   |-- tenantType: string (nullable = true)
   |-- label: string (nullable = true)
   |-- propertyId: long (nullable = true)
   |-- created: timestamp (nullable = true)
   |-- deleted: timestamp (nullable = true)
   |-- eventTime: string (nullable = true)
    
   
+----------------------+------------------------------------+----------+---------+
   |_hoodie_partition_path|eventId                             |eventTime 
|companyId|
   
+----------------------+------------------------------------+----------+---------+
   |2022-08-27            
|6ec9a519-40a3-4955-be63-df42627d9898|2022-08-27|600953   |
   |2022-08-27            
|859c5d1e-f458-44e1-a14f-de2a9db16bc2|2022-08-27|223727   |
   |2022-08-27            
|797f4c95-c5f3-4034-9c09-d55d0d24f0fc|2022-08-27|730148   |
   |2022-08-27            
|c02cb2be-1113-44a4-8d60-53ad77873da3|2022-08-27|413799   |
   |2022-08-27            
|a276685f-22a3-4a21-94fd-5f4242216abd|2022-08-27|824036   |
   |2022-08-27            
|d1043a46-f3bf-46c3-b3ff-d1f64f2f6829|2022-08-27|647835   |
   |2022-08-27            
|a8d1f925-f55e-4900-a266-491d496e5f0e|2022-08-27|187089   |
   |2022-08-27            
|2ee0f31c-5691-4a6b-af46-50652aa8617e|2022-08-27|683024   |
   |2022-08-27            
|cbb24162-b41d-4c98-baa6-6aae59123753|2022-08-27|780756   |
   |2022-08-27            
|82e22107-5393-4be0-8d92-6e0f71757f67|2022-08-27|203468   |
   
+----------------------+------------------------------------+----------+---------+
   only showing top 10 rows
   
   
   **Expected behavior**
   with option hoodie.datasource.read.extract.partition.values.from.path = 
false, should it be reading the eventTIme from the parquet file instead of path 
value?
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : 3.2 (and 3.1)
   
   * Hive version :
   
   * Hadoop version : (3.3)
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] sstimmel opened a new issue, #6798: [SUPPORT] - can't retrieve the partition field in stored parquet file

Reply via email to