[PR] ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE_ARRAY` type [orc]

via GitHub Wed, 24 Apr 2024 04:07:24 -0700


cxzl25 opened a new pull request, #1910:
URL: https://github.com/apache/orc/pull/1910


   ### What changes were proposed in this pull request?
   This PR aims to write parquet decimal type data in Benchmark using 
`FIXED_LEN_BYTE_ARRAY` type.
   
   ### Why are the changes needed?
   Because the decimal type of the parquet file generated now corresponds to 
the binary type of parquet, but Spark3.5.1 does not support reading.
   Spark 3.5.1 supports reading if using the `FIXED_LEN_BYTE_ARRAY` type.
   
   main
   ```
     optional binary fare_amount (DECIMAL(8,2));
   ```
   
   PR
   ```
     optional fixed_len_byte_array(5) fare_amount (DECIMAL(10,2));
   ```
   
   ```bash
   java -jar spark/target/orc-benchmarks-spark-2.1.0-SNAPSHOT.jar spark data 
-format=parquet  -compress zstd -data taxi
   ```
   
   ```java
   
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException:
 column: [fare_amount], physicalType: BINARY, logicalType: decimal(8,2)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.constructConvertNotSupportedException(ParquetVectorUpdaterFactory.java:1136)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.getUpdater(ParquetVectorUpdaterFactory.java:199)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:175)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:342)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:233)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.orc.bench.spark.SparkBenchmark.processReader(SparkBenchmark.java:170)
        at 
org.apache.orc.bench.spark.SparkBenchmark.fullRead(SparkBenchmark.java:216)
        at 
org.apache.orc.bench.spark.jmh_generated.SparkBenchmark_fullRead_jmhTest.fullRead_avgt_jmhStub(SparkBenchmark_fullRead_jmhTest.java:219)
   ```
   
   
   ### How was this patch tested?
   local test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@orc.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] ORC-1700: Write parquet decimal type data in Benchmark using `FIXED_LEN_BYTE_ARRAY` type [orc]

Reply via email to