sstimmel opened a new issue, #6798: URL: https://github.com/apache/hudi/issues/6798
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I am testing out partitioning a dataset by an eventTime which is timestamp column, but only want up to day precision for partitioning. Is there a way to read back the original value from hudi instead of the truncated value? **To Reproduce** Configs hoodie.datasource.write.recordkey.field=companyId hoodie.datasource.write.precombine.field=eventTime hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.write.hive_style_partitioning=false hoodie.datasource.hive_sync.enable=false hoodie.datasource.write.drop.partition.columns=false hoodie.deltastreamer.source.dfs.root=s3://blah/tenantconfig/raw hoodie.deltastreamer.source.hoodieincr.read_latest_on_missing_ckpt=true hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator hoodie.datasource.write.partitionpath.field=eventTime hoodie.datasource.write.keygen.timebased.timestamp.type=SCALAR hoodie.datasource.write.keygen.timebased.timezone=UTC hoodie.datasource.write.keygen.timebased.timestamp.scalar.time.unit=microseconds hoodie.datasource.write.keygen.timebased.output.dateformat=yyyy-MM-dd hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=microseconds hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy-MM-dd hoodie.deltastreamer.keygen.timebased.timezone=UTC hoodie.deltastreamer.source.s3incr.fs.prefix=s3a hoodie.index.type=GLOBAL_SIMPLE hoodie.simple.index.update.partition.path=true hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=200 hoodie.keep.min.commits=250 hoodie.keep.max.commits=500 hoodie.allow.empty.commit=false hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=true hoodie.parquet.outputtimestamptype=TIMESTAMP_MICROS If I read a particular partition folder in parquet format, I can get the original eventTime values spark.read.format("parquet").load(testPath+"/2022-06-14").createOrReplaceTempView("t") spark.sql("select eventId, eventTime,companyId from t").show(10, false) +------------------------------------+-----------------------+---------+ |eventId |eventTime |companyId| +------------------------------------+-----------------------+---------+ |f08eae6d-6103-4b4d-8d3f-348477ab055c|2022-06-14 05:34:49.128|1285302 | |49ecd5b2-c782-482f-b796-b008b9091d8b|2022-06-14 05:34:52.83 |1285306 | |b6eab34e-9e7d-4365-87ef-36086e18a3a0|2022-06-14 11:00:30.96 |1285489 | |1697c79d-0180-42bc-89f8-e29d3bb806c7|2022-06-14 08:27:49.169|1285375 | |6ecf4ffe-a937-4d3e-928e-3edfb09becdd|2022-06-14 08:28:21.978|1285379 | |cc774a92-ee81-4e41-9228-b636af58e48c|2022-06-14 05:34:07.788|1285261 | |c26f12ba-8b6a-4eef-a9ff-3051d65f72d2|2022-06-14 11:02:37.454|1285492 | |e70af180-fd97-48d8-9ab2-d386154f9aad|2022-06-14 08:28:24.475|1285380 | |7b3be223-05a2-4a77-9136-899eb2fb05d7|2022-06-14 08:31:14.847|1285383 | |29c9afa0-5aa9-4c4d-972b-542ea1762daa|2022-06-14 08:31:16.055|1285385 | +------------------------------------+-----------------------+---------+ only showing top 10 rows Reading in hudi format, with the following option, still is returning the value as a string val df = spark.read.option("hoodie.datasource.read.extract.partition.values.from.path", "false").format("org.apache.hudi").load(testPath) df.printSchema() spark.read.format("org.apache.hudi").option("hoodie.datasource.read.extract.partition.values.from.path", "false").load(testPath).createOrReplaceTempView("temp2") spark.sql("select _hoodie_partition_path, eventId, eventTime,companyId from temp2").show(10, false) root |-- _hoodie_commit_time: string (nullable = true) |-- _hoodie_commit_seqno: string (nullable = true) |-- _hoodie_record_key: string (nullable = true) |-- _hoodie_partition_path: string (nullable = true) |-- _hoodie_file_name: string (nullable = true) |-- eventId: string (nullable = true) |-- companyId: long (nullable = true) |-- configId: string (nullable = true) |-- tenantType: string (nullable = true) |-- label: string (nullable = true) |-- propertyId: long (nullable = true) |-- created: timestamp (nullable = true) |-- deleted: timestamp (nullable = true) |-- eventTime: string (nullable = true) +----------------------+------------------------------------+----------+---------+ |_hoodie_partition_path|eventId |eventTime |companyId| +----------------------+------------------------------------+----------+---------+ |2022-08-27 |6ec9a519-40a3-4955-be63-df42627d9898|2022-08-27|600953 | |2022-08-27 |859c5d1e-f458-44e1-a14f-de2a9db16bc2|2022-08-27|223727 | |2022-08-27 |797f4c95-c5f3-4034-9c09-d55d0d24f0fc|2022-08-27|730148 | |2022-08-27 |c02cb2be-1113-44a4-8d60-53ad77873da3|2022-08-27|413799 | |2022-08-27 |a276685f-22a3-4a21-94fd-5f4242216abd|2022-08-27|824036 | |2022-08-27 |d1043a46-f3bf-46c3-b3ff-d1f64f2f6829|2022-08-27|647835 | |2022-08-27 |a8d1f925-f55e-4900-a266-491d496e5f0e|2022-08-27|187089 | |2022-08-27 |2ee0f31c-5691-4a6b-af46-50652aa8617e|2022-08-27|683024 | |2022-08-27 |cbb24162-b41d-4c98-baa6-6aae59123753|2022-08-27|780756 | |2022-08-27 |82e22107-5393-4be0-8d92-6e0f71757f67|2022-08-27|203468 | +----------------------+------------------------------------+----------+---------+ only showing top 10 rows **Expected behavior** with option hoodie.datasource.read.extract.partition.values.from.path = false, should it be reading the eventTIme from the parquet file instead of path value? **Environment Description** * Hudi version : 0.11.1 * Spark version : 3.2 (and 3.1) * Hive version : * Hadoop version : (3.3) * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
