kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716
Hi @nsivabalan I see that Danny assigned you to this ticket.
I was able to replicate this exact case. Previous one was not exactly
exposing the issue. I'll update the repo soon.
Here's what I found out and some additional info.
First I'm wondering why in the stacktrace I'm getting:
```
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at
org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
```
I don't have any Long/Bigint in the table schema nor in the incoming schema,
all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for
spark).
The schema of the incoming batch should look like this:
```
col1 String
col2 String
col3 Int
col4 Int
col5 Timestamp
partitionCol1 int
partitionCol2 int
col6 String
col7 timestamp
col8 int
col9 string
col10 string
col11 string
col12 string
col13 decimal
col14 string
col15 string
col16 string
col17 string
col17 string
col18 string
col18 string
col19 int
col20 string
col21 string
col22 string
col23 int
col24 string
col25 string
col26 string
col27 string
col28 string
col29 string
col30 string
col31 int
col31 int
col32 string
col33 string
col34 string
col35 string
col36 string
col37 string
```
Schema of clustered parquet files:
```
col1 String
col2 String
col3 Int
col4 Int
col5 Timestamp
col6 String // (earlier at this place was partitionCol1 int) Tries to read
Int but instead needs to read String? idk
partitionCol2 int
col7 timestamp
col8 int
col9 string
col10 string
col11 string
col12 string
col13 decimal
col14 string
col15 string
col16 string
col17 string
col17 string
col18 string
col18 string
col19 int
col20 string
col21 string
col22 string
col23 int
col24 string
col25 string
col26 string
col27 string
col28 string
col29 string
col30 string
col31 int
col31 int
col32 string
col33 string
col34 string
col35 string
col36 string
col37 string
partitionCol1 int
```
Schema in replacecommit conforms to the incoming batch schema/table schema
(is correct).
I don't know if Hudi resolves columns by position or by name and if it
matters when reading parquet file for merging.
If it was by position then col6 String (earlier at this place was
partitionCol1 int) Hudi will try to read this column as of type Int but instead
needs to read String?
Therefore it can't since for PlainLongDictionary there's no decodeToBinary
implementation available?
Idk if it makes any sense at all, but this is my intuition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]