Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

via GitHub Tue, 16 Jun 2026 00:22:41 -0700


LuciferYang commented on PR #55919:
URL: https://github.com/apache/spark/pull/55919#issuecomment-4715989406


   @iemejia The test failures seems related to the current PR. I will revert 
this change and reopen the PR first. We can fix the issue and merge it again 
afterward: 
   - https://github.com/apache/spark/actions/runs/27590688058/job/81572596324
   
   ```
    parquet widening conversion IntegerType -> LongType: 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypeWideningSuite
   org.scalatest.exceptions.TestFailedException: with dictionary encoding 
'false' with timestamp rebase mode 'CORRECTED'' Vectorized reader 
   Results do not match for query:
   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=311,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
   Timezone Env: 
   
   == Parsed Logical Plan ==
   UnresolvedDataSource format: parquet, isStreaming: false, paths: 1 provided
   
   == Analyzed Logical Plan ==
   a: bigint
   Relation [a#1708954L] parquet
   
   == Optimized Logical Plan ==
   Relation [a#1708954L] parquet
   
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet [a#1708954L] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex(1 
paths)[file:/home/runner/work/spark/spark/target/tmp/spark-1407ec22-ddd8-4a7c...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint>
   
   == Results ==
   
   == Results ==
   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
    struct<a:bigint>           struct<a:bigint>
   ![-2147483648]              [1]
   ![1]                        [2147483648]
    [2]                        [2]
       
          
   sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: with 
dictionary encoding 'false' with timestamp rebase mode 'CORRECTED'' Vectorized 
reader 
   Results do not match for query:
   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=311,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]]
   Timezone Env:
   
   
   == Parsed Logical Plan ==
   UnresolvedDataSource format: parquet, isStreaming: false, paths: 1 provided
   
   
   == Analyzed Logical Plan ==
   a: bigint
   Relation [a#1708954L] parquet
   
   
   == Optimized Logical Plan ==
   Relation [a#1708954L] parquet
   
   
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet [a#1708954L] Batched: true, DataFilters: [], Format: 
Parquet, Location: InMemoryFileIndex(1 
paths)[file:/home/runner/work/spark/spark/target/tmp/spark-1407ec22-ddd8-4a7c...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint>
   
   
   == Results ==
   
   
   == Results ==
   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
   struct<a:bigint>           struct<a:bigint>
   ![-2147483648]              [1]
   ![1]                        [2147483648]
   [2]                        [2]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56892][SQL] Bulk read optimization for Parquet DELTA_BINARY_PACKED decoding [spark]

Reply via email to