Re: [PR] GH-3356: Add buffers allocated by vectored IO for releasing [parquet-java]

via GitHub Mon, 17 Nov 2025 02:38:47 -0800


annimesh2809 commented on code in PR #3357:
URL: https://github.com/apache/parquet-java/pull/3357#discussion_r2533562864



##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1354,7 +1382,7 @@ private void readVectored(List<ConsecutivePartList> 
allParts, ChunkListBuilder b
     }
     LOG.debug("Reading {} bytes of data with vectored IO in {} ranges", 
totalSize, ranges.size());
     // Request a vectored read;
-    f.readVectored(ranges, options.getAllocator());
+    f.readVectored(ranges, new ReleasingAllocator(options.getAllocator(), 
builder.releaser));

Review Comment:
   When I try to build parquet-mr with hadoop 3.4.2 without any additional 
changes, I see `testRangeFiltering` test case (and some others) of 
`TestParquetReader` suite fail. The `TrackingByteBufferAllocator` reveals that 
the **unreleased allocation** happens in:
   ```
     Cause: 
org.apache.parquet.bytes.TrackingByteBufferAllocator$ByteBufferAllocationStacktraceException:
 Allocation stacktrace of the first ByteBuffer:
     at 
org.apache.parquet.bytes.TrackingByteBufferAllocator$ByteBufferAllocationStacktraceException.create(TrackingByteBufferAllocator.java:96)
     at 
org.apache.parquet.bytes.TrackingByteBufferAllocator.allocate(TrackingByteBufferAllocator.java:136)
     at 
org.apache.hadoop.fs.impl.VectorIOBufferPool.getBuffer(VectorIOBufferPool.java:65)
     at 
org.apache.hadoop.fs.RawLocalFileSystem$AsyncHandler.initiateRead(RawLocalFileSystem.java:400)
     at 
org.apache.hadoop.fs.RawLocalFileSystem$AsyncHandler.access$000(RawLocalFileSystem.java:360)
     at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.readVectored(RawLocalFileSystem.java:345)
     at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.readVectored(RawLocalFileSystem.java:324)
     at 
org.apache.hadoop.fs.BufferedFSInputStream.readVectored(BufferedFSInputStream.java:183)
     at 
org.apache.hadoop.fs.FSDataInputStream.readVectored(FSDataInputStream.java:308)
     at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readVectored(ChecksumFileSystem.java:474)
     at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readVectored(ChecksumFileSystem.java:463)
   ```
   The root cause here seems to be that ChecksumFileSystem (coming from hadoop) 
starts supporting `readVectored`: 
https://github.com/apache/hadoop/blob/branch-3.4.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L460-L513.
   `ChecksumFileSystem.readVectored` internally does more allocations like:
   ```
   sums.readVectored(checksumRanges, allocate, release);
   datas.readVectored(dataRanges, allocate, release);
   ```
   which are not marked for release by `ByteBufferReleaser`.
   
   Also with vectored reads, it is not sufficient to mark the buffers returned 
by the allocator for release, as they are sliced internally and the returned 
buffer object is different even though the underlying memory remains the same.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-3356: Add buffers allocated by vectored IO for releasing [parquet-java]

Reply via email to