eadwright commented on pull request #902:
URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844147358


   > @eadwright, what do you mean by "necessary for the rows to spill over into 
a second row group."? It shall not be possible. Even the pages keep row 
boundaries but for row groups it is required by the specification.
   
   Sorry I probably didn't phrase that well. I mean, for this bug to occur, you 
need i) a row group which is taking more than 2GB of space, to get the 2^31 
signed-int overflow, and ii) another subsequent row group (any size) so the 
code with the bug adds to a file offset using a corrupted value.
   
   In the example python code, if I asked it to produce 50M rows instead of 
75M, you get a ~3.3GB row group, but no second row group. The file offset 
addition code path is not executed and the file gets read correctly, the bug is 
not triggered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to