bersprockets opened a new pull request, #42850:
URL: https://github.com/apache/spark/pull/42850
### What changes were proposed in this pull request?
Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in
`OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if
present.
### Why are the changes needed?
The following query gets incorrect results:
```
drop table if exists t1;
create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);
select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from
t1;
{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}
```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`,
which in turn calls `ColumnarArray#toIntArray`, which in turn calls
`ColumnVector#getInts` on the underlying column vector (which is either an
`OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of
`getInts` in either concrete class assumes there is no dictionary and does not
use it if it is present (in fact, it even asserts that there is no dictionary).
However, in the above example, the column vector associated with the second
array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar
org.apache.parquet.tools.Main meta
./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN
ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3
ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]
```
The same bug also occurs when field f2 is a map. This PR fixes that case as
well.
### Does this PR introduce _any_ user-facing change?
No, except for fixing the correctness issue.
### How was this patch tested?
New tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]