cxzl25 opened a new pull request, #1902:
URL: https://github.com/apache/orc/pull/1902
### What changes were proposed in this pull request?
This PR aims to fix `IllegalArgumentException` when reading json timestamp
type in bechmark.
### Why are the changes needed?
ORC-1191 Switch the csv format of taxi to parquet and read the timestamp
format of parquet, but it is in microseconds format, which is different from
the millisecond format of Java's `java.sql.Timestamp`.
parquet meta
```bash
optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));
```
When we write the data into json and then use the scan command, we will get
the following error.
```java
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
```
```
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp
format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
at
org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
at
org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
at
org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
at org.apache.orc.bench.core.Driver.main(Driver.java:64)
```
If we use orc-tools to dump the generated ORC file metadata, the timestamp
data is also incorrect.
```
Column 2: count: 2053120 hasNull: false bytesOnDisk: 8113763 min:
47802-07-26 08:00:00.0 max: 47817-09-26 23:43:20.0
Column 3: count: 2053120 hasNull: false bytesOnDisk: 8461151 min:
47802-07-26 08:00:00.0 max: 48731-09-12 15:43:20.0
```
### How was this patch tested?
local test
```bash
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
```
output
```
data/generated/taxi/json.snappy rows: 22758236 batches: 22225
data/generated/taxi/json.gz rows: 22758236 batches: 22225
data/generated/sales/json.snappy rows: 25000000 batches: 24415
data/generated/sales/json.gz rows: 25000000 batches: 24415
data/generated/github/json.snappy rows: 10489642 batches: 10244
data/generated/github/json.gz rows: 10489642 batches: 10244
```
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]