[
https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881389#comment-15881389
]
ASF GitHub Bot commented on DRILL-5266:
---------------------------------------
Github user ppadma commented on a diff in the pull request:
https://github.com/apache/drill/pull/749#discussion_r102830312
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
---
@@ -33,35 +33,52 @@
ParquetRecordReader parentReader;
final List<VarLengthColumn<? extends ValueVector>> columns;
final boolean useAsyncTasks;
+ private final long targetRecordCount;
public VarLenBinaryReader(ParquetRecordReader parentReader,
List<VarLengthColumn<? extends ValueVector>> columns) {
this.parentReader = parentReader;
this.columns = columns;
useAsyncTasks = parentReader.useAsyncColReader;
+
+ // Can't read any more records than fixed width fields will fit.
+ // Note: this calculation is very likely wrong; it is a simplified
+ // version of earlier code, but probably needs even more attention.
+
+ int totalFixedFieldWidth = parentReader.getBitWidthAllFixedFields() /
8;
+ if (totalFixedFieldWidth == 0) {
+ targetRecordCount = 0;
+ } else {
+ targetRecordCount = parentReader.getBatchSize() /
totalFixedFieldWidth;
+ }
}
/**
* Reads as many variable length values as possible.
*
* @param recordsToReadInThisPass - the number of records recommended
for reading form the reader
- * @param firstColumnStatus - a reference to the first column status in
the parquet file to grab metatdata from
+ * @param firstColumnStatus - a reference to the first column status in
the Parquet file to grab metatdata from
* @return - the number of fixed length fields that will fit in the batch
* @throws IOException
*/
public long readFields(long recordsToReadInThisPass, ColumnReader<?>
firstColumnStatus) throws IOException {
- long recordsReadInCurrentPass = 0;
-
// write the first 0 offset
for (VarLengthColumn<?> columnReader : columns) {
columnReader.reset();
}
Stopwatch timer = Stopwatch.createStarted();
- recordsReadInCurrentPass =
determineSizesSerial(recordsToReadInThisPass);
- if(useAsyncTasks){
+ long recordsReadInCurrentPass =
determineSizesSerial(recordsToReadInThisPass);
+
+ // Can't read any more records than fixed width fields will fit.
+
+ if (targetRecordCount > 0) {
+ recordsToReadInThisPass = Math.min(recordsToReadInThisPass,
targetRecordCount);
--- End diff --
I think you mean to update recordsReadInCurrentPass.
recordsToReadInThisPass is not being used after this. So, what is the point in
updating ?
> Parquet Reader produces "low density" record batches - bits vs. bytes
> ---------------------------------------------------------------------
>
> Key: DRILL-5266
> URL: https://issues.apache.org/jira/browse/DRILL-5266
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.10
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Labels: ready-to-commit
>
> Testing with the managed sort revealed that, for at least one file, Parquet
> produces "low-density" batches: batches in which only 5% of each value vector
> contains actual data, with the rest being unused space. When fed into the
> sort, we end up buffering 95% of wasted space, using only 5% of available
> memory to hold actual query data. The result is poor performance of the sort
> as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use
> estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
> T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size:
> 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
> c_email_address(std col. size: 54, actual col. size: 27, total size: 53248,
> vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
> Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)