annimesh2809 commented on code in PR #3357:
URL: https://github.com/apache/parquet-java/pull/3357#discussion_r2533562864
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1354,7 +1382,7 @@ private void readVectored(List<ConsecutivePartList>
allParts, ChunkListBuilder b
}
LOG.debug("Reading {} bytes of data with vectored IO in {} ranges",
totalSize, ranges.size());
// Request a vectored read;
- f.readVectored(ranges, options.getAllocator());
+ f.readVectored(ranges, new ReleasingAllocator(options.getAllocator(),
builder.releaser));
Review Comment:
When I try to build parquet-mr with hadoop 3.4.2 without any additional
changes, I see `testRangeFiltering` test case (and some others) of
`TestParquetReader` suite fail. The `TrackingByteBufferAllocator` reveals that
the **unreleased allocation** happens in:
```
Cause:
org.apache.parquet.bytes.TrackingByteBufferAllocator$ByteBufferAllocationStacktraceException:
Allocation stacktrace of the first ByteBuffer:
at
org.apache.parquet.bytes.TrackingByteBufferAllocator$ByteBufferAllocationStacktraceException.create(TrackingByteBufferAllocator.java:96)
at
org.apache.parquet.bytes.TrackingByteBufferAllocator.allocate(TrackingByteBufferAllocator.java:136)
at
org.apache.hadoop.fs.impl.VectorIOBufferPool.getBuffer(VectorIOBufferPool.java:65)
at
org.apache.hadoop.fs.RawLocalFileSystem$AsyncHandler.initiateRead(RawLocalFileSystem.java:400)
at
org.apache.hadoop.fs.RawLocalFileSystem$AsyncHandler.access$000(RawLocalFileSystem.java:360)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.readVectored(RawLocalFileSystem.java:345)
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.readVectored(RawLocalFileSystem.java:324)
at
org.apache.hadoop.fs.BufferedFSInputStream.readVectored(BufferedFSInputStream.java:183)
at
org.apache.hadoop.fs.FSDataInputStream.readVectored(FSDataInputStream.java:308)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readVectored(ChecksumFileSystem.java:474)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readVectored(ChecksumFileSystem.java:463)
```
The root cause here seems to be that ChecksumFileSystem (coming from hadoop)
starts supporting `readVectored`:
https://github.com/apache/hadoop/blob/branch-3.4.2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L460-L513.
`ChecksumFileSystem.readVectored` internally does more allocations like:
```
sums.readVectored(checksumRanges, allocate, release);
datas.readVectored(dataRanges, allocate, release);
```
which are not marked for release by `ByteBufferReleaser`.
Also with vectored reads, it is not sufficient to mark the buffers returned
by the allocator for release, as they are sliced internally and the returned
buffer object is different even though the underlying memory remains the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]