[GitHub] [spark] sunchao commented on pull request #34199: [SPARK-36935][SQL] Extend ParquetSchemaConverter to compute Parquet repetition & definition level

GitBox Mon, 01 Nov 2021 11:01:40 -0700


sunchao commented on pull request #34199:
URL: https://github.com/apache/spark/pull/34199#issuecomment-956456431



   Thanks @sadikovi , the changes on schema converter actually doesn't modify 
the existing behavior at all. I also added extensive tests to check the 
behavior of the newly introduced API, which will be used in the complex type 
read path. I chose to piggy-back on the existing logic in order to remove code 
duplication and lower the maintenance cost (as you know, there are many 
different cases to handle legacy list & map format in Parquet).
   
   > push this logic of repetition and definition levels to column readers
   
   Do you mean calculating repetition & definition levels in column readers? 
there is no way to do that since column readers (i.e., 
`VectorizedColumnReader`) are only for primitive columns while we need the info 
for both primitive & complex types. Therefore I think the best place is 
`SpecificParquetRecordReaderBase` where we have the Spark read schema and 
Parquet message type. It is also the place where the assembly of the complex 
columns happens. 
   
   cc @viirya and @dongjoon-hyun , could you take another look too? thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on pull request #34199: [SPARK-36935][SQL] Extend ParquetSchemaConverter to compute Parquet repetition & definition level

Reply via email to