sunchao commented on pull request #34199: URL: https://github.com/apache/spark/pull/34199#issuecomment-956456431
Thanks @sadikovi , the changes on schema converter actually doesn't modify the existing behavior at all. I also added extensive tests to check the behavior of the newly introduced API, which will be used in the complex type read path. I chose to piggy-back on the existing logic in order to remove code duplication and lower the maintenance cost (as you know, there are many different cases to handle legacy list & map format in Parquet). > push this logic of repetition and definition levels to column readers Do you mean calculating repetition & definition levels in column readers? there is no way to do that since column readers (i.e., `VectorizedColumnReader`) are only for primitive columns while we need the info for both primitive & complex types. Therefore I think the best place is `SpecificParquetRecordReaderBase` where we have the Spark read schema and Parquet message type. It is also the place where the assembly of the complex columns happens. cc @viirya and @dongjoon-hyun , could you take another look too? thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
