mapleFU commented on issue #38552: URL: https://github.com/apache/arrow/issues/38552#issuecomment-1817785040
Sorry for late replying because I'm lazy > Increasing the thrift_..._size_limit options to the maximum value solves the problem but makes the memory usage blow up and the garbage collector doesn't collect that memory after reading. I think that this is because your metadata is too large. And this is apart from https://github.com/apache/arrow/issues/38245 . Parquet metadata is a thrift binary. Would you mind print the fileMetadata and it's size? In the data part, I think you might have too many columns, but the data size is not too large. So deserializing the parquet will be hard to deserializing the metadata. Csv doesn't has this metadata, so it will be easy to parsing them. Disable the statistics might help @Hugo-loio -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
