zeroshade commented on issue #622:
URL: https://github.com/apache/arrow-go/issues/622#issuecomment-3822818048

   I dug into this a bit more @qiushido, can you confirm for me that when this 
issue occurs, it corresponds with the uncompressed data for a single string 
column exceeding 2GB in a batch that you write? If so, then it would correspond 
with what I've found and I might have a way forward for you. 
   
   It looks like the primary issue is that the Parquet Spec doesn't allow for a 
DataPage's uncompressed size to be larger than what can fit in an int32 (i.e. 
~2GB). That said, you wouldn't want a single page to be that large anyways..... 
but if you're writing, say 50 rows that each have a 50MB string, you'd easily 
exceed that. In the current implementation of arrow-go's Parquet writer, we 
only check the size of the page *after* a batch is written. Which means if you 
write multiple rows at once as a batch, and the total uncompressed size of that 
single batch is larger than 2GB we'll just fail rather than properly split the 
values into separate pages, which is the bug.
   
   While this is something I'll need to fix regardless, I want to confirm that 
this is the source of the issue you're running into by confirming the size of 
the data you're writing. Can you confirm this for me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to