[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

via GitHub Wed, 29 Mar 2023 12:56:15 -0700


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716


   Hi @nsivabalan I see that Danny assigned you to this ticket.
   I was able to replicate this exact case. Previous one was not exactly 
exposing the issue. I'll update the repo soon.
   
   Here's what I found out and some additional info.
   
   First I'm wondering why in the stacktrace I'm getting:
   ```
   java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
        at 
org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
   ```
        
   I don't have any Long/Bigint in the table schema nor in the incoming schema, 
all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for 
spark).
   
   The schema of the incoming batch should look like this:
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   partitionCol1 int
   partitionCol2 int
   col6 String
   col7 timestamp
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   ```
   
   Schema of clustered parquet files:
   
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   col6 String // (earlier at this place was partitionCol1 int) Tries to read 
Int but instead needs to read String? idk
   partitionCol2 int
   col7 timestamp 
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   partitionCol1 int
   ```
   
   Schema in replacecommit conforms to the incoming batch schema/table schema 
(is correct).
   I don't know if Hudi resolves columns by position or by name and if it 
matters when reading parquet file for merging.
   If it was by position then col6 String (earlier at this place was 
partitionCol1 int) Hudi will try to read this column as of type Int but instead 
needs to read String? 
   Therefore it can't since for PlainLongDictionary there's no decodeToBinary 
implementation available?
   Idk if it makes any sense at all, but this is my intuition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

Reply via email to