[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

GitBox Wed, 21 Apr 2021 14:20:50 -0700


vdiravka commented on a change in pull request #2143:
URL: https://github.com/apache/drill/pull/2143#discussion_r617883888




##########
File path: 
exec/java-exec/src/main/java/org/apache/parquet/hadoop/ParquetColumnChunkPageWriteStore.java
##########
@@ -260,14 +260,16 @@ public long getMemSize() {
     }
 
     /**
-     * Writes a number of pages within corresponding column chunk
+     * Writes a number of pages within corresponding column chunk <br>
+     * // TODO: the Bloom Filter can be useful in filtering entire row groups,
+     *     see <a 
href="https://issues.apache.org/jira/browse/DRILL-7895";>DRILL-7895</a>

Review comment:
       1. I have checked heap memory after creating `ColumnChunkPageWriteStore` 
with `VisualVM` the size is the same:
   https://ibb.co/xFYqC0m
   https://ibb.co/fNB7MBq
   2. The allocator is passed to `ColumnChunkPageWriteStore` and 
`ColumnChunkPageWriter` too and really DrillBuf is used in process of writing 
the parquet file.
   3. And we converted that `buf` to bytes via `BytesInput.from(buf)` and 
`compressedBytes.writeAllTo(buf)`. So all data still placed in heap.
   4. We already have several other places, where `ColumnChunkPageWriteStore` 
is used not directly
   
   So looks like updated `ColumnChunkPageWriteStore` will menage heap memory 
even better in process of creating parquet files via Drill and we safe here to 
go with current change.
   
   And the proper way to use Direct memory more that now is to make 
improvements in Parquet. One of them is 
[PARQUET-1771](https://github.com/apache/parquet-mr/pull/749), but that one 
will not help here. So I want to proceed with PARQUET-1006. Looks like we can 
use direct memory `buf` for `ColumnChunkPageWriteStore`, `ParquetFileWriter` 
and for `ByteArrayOutputStream`. I am planning to ask community about it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] vdiravka commented on a change in pull request #2143: DRILL-7825: Unknown logical type in Parquet

Reply via email to