[
https://issues.apache.org/jira/browse/DRILL-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322882#comment-16322882
]
ASF GitHub Bot commented on DRILL-5846:
---------------------------------------
Github user sachouche commented on a diff in the pull request:
https://github.com/apache/drill/pull/1060#discussion_r161032779
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableColumnReader.java
---
@@ -165,17 +181,133 @@
+ "Run Length: {} \t Null Run Length: {} \t readCount: {} \t
writeCount: {} \t "
+ "recordsReadInThisIteration: {} \t
valuesReadInCurrentPass: {} \t "
+ "totalValuesRead: {} \t readStartInBytes: {} \t
readLength: {} \t pageReader.byteLength: {} \t "
- + "definitionLevelsRead: {} \t pageReader.currentPageCount:
{}",
+ + "currPageValuesProcessed: {} \t
pageReader.currentPageCount: {}",
recordsToReadInThisPass, runLength, nullRunLength, readCount,
writeCount, recordsReadInThisIteration, valuesReadInCurrentPass,
totalValuesRead, readStartInBytes, readLength,
pageReader.byteLength,
- definitionLevelsRead, pageReader.currentPageCount);
+ currPageValuesProcessed, pageReader.currentPageCount);
}
valueVec.getMutator().setValueCount(valuesReadInCurrentPass);
}
+ private final void processPagesBulk(long recordsToReadInThisPass) throws
IOException {
+ readStartInBytes = 0;
+ readLength = 0;
+ readLengthInBits = 0;
+ recordsReadInThisIteration = 0;
+ vectorData = castedBaseVector.getBuffer();
+
+ // values need to be spaced out where nulls appear in the column
+ // leaving blank space for nulls allows for random access to values
+ // to optimize copying data out of the buffered disk stream, runs of
defined values
+ // are located and copied together, rather than copying individual
values
+
+ int valueCount = 0;
+ final int maxValuesToProcess = Math.min((int) recordsToReadInThisPass,
valueVec.getValueCapacity());
+
+ // To handle the case where the page has been already loaded
+ if (pageReader.definitionLevels != null && currPageValuesProcessed ==
0) {
+ definitionLevelWrapper.set(pageReader.definitionLevels,
pageReader.currentPageCount);
+ }
+
+ while (valueCount < maxValuesToProcess) {
+
+ // read a page if needed
+ if (!pageReader.hasPage() || (currPageValuesProcessed ==
pageReader.currentPageCount)) {
+ if (!pageReader.next()) {
+ break;
+ }
+
+ //New page. Reset the definition level.
+ currPageValuesProcessed = 0;
+ recordsReadInThisIteration = 0;
+ readStartInBytes = 0;
+
+ // Update the Definition Level reader
+ definitionLevelWrapper.set(pageReader.definitionLevels,
pageReader.currentPageCount);
+ }
+
+ definitionLevelWrapper.readFirstIntegerIfNeeded();
+
+ int numNullValues = 0;
+ int numNonNullValues = 0;
+ final int remaining = maxValuesToProcess - valueCount;
+ int currBatchSz = Math.min(remaining,
(pageReader.currentPageCount - currPageValuesProcessed));
+ assert currBatchSz > 0;
+
+ // Let's skip the next run of nulls if any ...
+ int idx;
+ for (idx = 0; idx < currBatchSz; ++idx) {
+ if (definitionLevelWrapper.readCurrInteger() == 1) {
+ break; // non-value encountered
+ }
+ definitionLevelWrapper.nextIntegerIfNotEOF();
+ }
+ numNullValues += idx;
+
+ // Write the nulls if any
--- End diff --
This is the original logic for processing fixed precision columns (except
more efficient); the intent is to figure out how many null values there are
before processing non-null values. We could have avoided this for-loop if we
controlled the Parquet ValuesReader API; I was tempted to do that but decided
to prioritize my tasks. The good news is that I recorded all such improvement
opportunities and will try to tackle them at some point.
> Improve Parquet Reader Performance for Flat Data types
> -------------------------------------------------------
>
> Key: DRILL-5846
> URL: https://issues.apache.org/jira/browse/DRILL-5846
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Affects Versions: 1.11.0
> Reporter: salim achouche
> Assignee: salim achouche
> Labels: performance
> Fix For: 1.13.0
>
>
> The Parquet Reader is a key use-case for Drill. This JIRA is an attempt to
> further improve the Parquet Reader performance as several users reported that
> Parquet parsing represents the lion share of the overall query execution. It
> tracks Flat Data types only as Nested DTs might involve functional and
> processing enhancements (e.g., a nested column can be seen as a Document;
> user might want to perform operations scoped at the document level that is no
> need to span all rows). Another JIRA will be created to handle the nested
> columns use-case.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)