vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r617883888
##########
File path:
exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
}
/**
- * Writes a number of pages within corresponding column chunk
+ * Writes a number of pages within corresponding column chunk <br>
+ * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+ * see <a
href="https://issues.apache.org/jira/browse/DRILL-7895">DRILL-7895</a>
Review comment:
1. I have checked heap memory after creating `ColumnChunkPageWriteStore`
with `VisualVM` the size is the same:
https://ibb.co/xFYqC0m
https://ibb.co/fNB7MBq
2. The allocator is passed to `ColumnChunkPageWriteStore` and
`ColumnChunkPageWriter` too and really DrillBuf is used in process of writing
the parquet file.
3. And we converted that `buf` to bytes via `BytesInput.from(buf)` and
`compressedBytes.writeAllTo(buf)`. So all data still placed in heap.
4. We already have several other places, where `ColumnChunkPageWriteStore`
is used not directly
So looks like updated `ColumnChunkPageWriteStore` will menage heap memory
even better in process of creating parquet files via Drill and we safe here to
go with current change.
And the proper way to use Direct memory more that now is to make
improvements in Parquet. One of them is
[PARQUET-1771](https://github.com/apache/parquet-mr/pull/749), but that one
will not help here. So I want to proceed with PARQUET-1006. Looks like we can
use direct memory `buf` for `ColumnChunkPageWriteStore`, `ParquetFileWriter`
and for `ByteArrayOutputStream`. I am planning to ask community about it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]