zeroshade commented on issue #622: URL: https://github.com/apache/arrow-go/issues/622#issuecomment-3822818048
I dug into this a bit more @qiushido, can you confirm for me that when this issue occurs, it corresponds with the uncompressed data for a single string column exceeding 2GB in a batch that you write? If so, then it would correspond with what I've found and I might have a way forward for you. It looks like the primary issue is that the Parquet Spec doesn't allow for a DataPage's uncompressed size to be larger than what can fit in an int32 (i.e. ~2GB). That said, you wouldn't want a single page to be that large anyways..... but if you're writing, say 50 rows that each have a 50MB string, you'd easily exceed that. In the current implementation of arrow-go's Parquet writer, we only check the size of the page *after* a batch is written. Which means if you write multiple rows at once as a batch, and the total uncompressed size of that single batch is larger than 2GB we'll just fail rather than properly split the values into separate pages, which is the bug. While this is something I'll need to fix regardless, I want to confirm that this is the source of the issue you're running into by confirming the size of the data you're writing. Can you confirm this for me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
