[PR] GH-3249: Fix incorrect Bloom filter data when reading from ByteArrayInputStream by using readFully() [parquet-java]

via GitHub Thu, 10 Jul 2025 19:57:37 -0700


wangyum opened a new pull request, #3250:
URL: https://github.com/apache/parquet-java/pull/3250


   ### Rationale for this change
   
   When reading Bloom filter data from files with older versions(< Parquet 
1.13), the code uses `in.read(bitset)` to read the bitset data. However, 
`InputStream.read(byte[])` doesn't guarantee reading all requested bytes in a 
single call - it may read fewer bytes than the buffer size and the remaining 
portion of the buffer stays uninitialized.
   
   This can lead to incorrect Bloom filter behavior as parts of the bitset 
might be missing or contain zeros instead of the actual data.
   
   ### What changes are included in this PR?
   
   This PR modifies the logic to properly ensure all bytes are read from the 
input stream:
   
   For older file versions (negative bloomFilterLength), we continue using 
f.readFully(bitset)
   For newer file versions (positive bloomFilterLength), we still use 
`in.read(bitset)`.
   
   ### Are these changes tested?
   
   Manual testing.
   
   ### Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] GH-3249: Fix incorrect Bloom filter data when reading from ByteArrayInputStream by using readFully() [parquet-java]

Reply via email to