emkornfield commented on issue #39676: URL: https://github.com/apache/arrow/issues/39676#issuecomment-1897645893
Thousands for rowgroups is an anti-pattern for laying out data (I understand some customers do it) but it creates exactly this type of performance bottleneck (sometimes this is out of our control though) but we should audit write config parameters to make sure there isn't something that is causing this type of spilling, and yes in general, parquet is not well suited to very large column widths. I think there is a better solution here but given that this touches metadata serialization I'm not sure the appetite in there will be for trying to incorporate metadata that parses faster. In any case format changes they need to be discussed on the parquet mailing list [email protected] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
