[PR] ORC-1697: Fix IllegalArgumentException when reading json timestamp type in bechmark [orc]

via GitHub Mon, 22 Apr 2024 02:00:06 -0700


cxzl25 opened a new pull request, #1902:
URL: https://github.com/apache/orc/pull/1902


   ### What changes were proposed in this pull request?
   This PR aims to fix `IllegalArgumentException` when reading json timestamp 
type in bechmark.
   
   ### Why are the changes needed?
   ORC-1191 Switch the csv format of taxi to parquet and read the timestamp 
format of parquet, but it is in microseconds format, which is different from 
the millisecond format of Java's `java.sql.Timestamp`.
   
   parquet meta
   ```bash
     optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
     optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));
   ```
   
   
   When we write the data into json and then use the scan command, we will get 
the following error.
   ```java
   java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
   ```
   
   ```
   Exception in thread "main" java.lang.IllegalArgumentException: Timestamp 
format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
        at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
        at 
org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
        at 
org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
        at 
org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
        at org.apache.orc.bench.core.Driver.main(Driver.java:64)
   ```
   
   
   If we use orc-tools to dump the generated ORC file metadata, the timestamp 
data is also incorrect.
   ```
       Column 2: count: 2053120 hasNull: false bytesOnDisk: 8113763 min: 
47802-07-26 08:00:00.0 max: 47817-09-26 23:43:20.0
       Column 3: count: 2053120 hasNull: false bytesOnDisk: 8461151 min: 
47802-07-26 08:00:00.0 max: 48731-09-12 15:43:20.0
   ```
   
   
   
   ### How was this patch tested?
   local test
   
   ```bash
   java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
   ```
   
   output
   
   ```
   data/generated/taxi/json.snappy rows: 22758236 batches: 22225
   data/generated/taxi/json.gz rows: 22758236 batches: 22225
   data/generated/sales/json.snappy rows: 25000000 batches: 24415
   data/generated/sales/json.gz rows: 25000000 batches: 24415
   data/generated/github/json.snappy rows: 10489642 batches: 10244
   data/generated/github/json.gz rows: 10489642 batches: 10244
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] ORC-1697: Fix IllegalArgumentException when reading json timestamp type in bechmark [orc]

Reply via email to