[
https://issues.apache.org/jira/browse/DRILL-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16368734#comment-16368734
]
Paul Rogers commented on DRILL-6166:
------------------------------------
The behavior described is by design: the `RecordBatchSizer` was intended to use
only for single batches (and of those, we assume no SV2).
The no-SV2 assumption arises because the only operator that accepts an SV2 is
the selection-vector remover (SVR). But, since the SVR has one batch out for
every batch in, if the input batch is sized properly, the output batch will be
no larger than the (properly sized) input batch.
(At present, the SV2 does not attempt to coalesce mostly-empty batches; it will
happily produce non-performant batches with just a few records. But, that is a
separate issue.)
Extend this discussion to SV4. Few operators produce an SV4. (I believe only
sort does, but perhaps there are others.) The planner should insert an SVR
after any operator that produces an SV4. Since we already explained why the SVR
does not need to use the "sizer", then this reasoning explains why the "sizer"
need not handle an SV4.
Under what situation are you trying to use the "sizer" in which you receive an
SV4? The stack shows the operators above (downstream) from the sizer use, but
not those below (upstream) that supplied the batches.
The simplest reading is that the planner somehow omitted an SV4. If, however,
the streaming aggregate created its on SV4, then the sizer is not needed: batch
sizing should have been done on the (SV-less) input batches.
Now, let's set aside all the above and assume that we really do want to apply
the sizer to an SV4 (hyper-batch). Doing so is really quite difficult. The SV4
provides a "view" into a much larger array of single batches. Each input batch
sees the same memory, but a different view. Since the "sizer" works by summing
memory sizes, it will be "fooled" by a hyper-batch: it will think the entire
hyper-batch memory belongs to the current view (assuming that the sizer were
modified to get the ledgers for all the vectors that make up the hyper batch.)
But, there is no way to decide which memory goes with view 1 (the first batch)
and which go with view 2 (the second batch): they all share the same memory
pool.
It might be possible to create a new sizer that:
* Sums all ledgers (by iterating over the vectors within a `VectorWrapper`.)
* Sum all rows in the hyper vector (not just those in the current view.)
* Use the existing calcs to work out average sizes.
* Apply the resulting numbers to all views of the same hyper-batch.
* Repeat the above if/when we get a view on a new hyper batch.
The above is possible, but complex. So, better to figure out if it is really
needed before we go down this path. (As noted above, such a solution probably
is *not* needed.)
> RecordBatchSizer does not handle hyper vectors
> ----------------------------------------------
>
> Key: DRILL-6166
> URL: https://issues.apache.org/jira/browse/DRILL-6166
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.12.0
> Reporter: Padma Penumarthy
> Assignee: Padma Penumarthy
> Priority: Critical
> Fix For: 1.13.0
>
>
> RecordBatchSizer throws an exception when incoming batch has hyper vector.
> (java.lang.UnsupportedOperationException) null
> org.apache.drill.exec.record.HyperVectorWrapper.getValueVector():61
> org.apache.drill.exec.record.RecordBatchSizer.<init>():346
> org.apache.drill.exec.record.RecordBatchSizer.<init>():311
>
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch$StreamingAggregateMemoryManager.update():198
>
> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext():328
> org.apache.drill.exec.record.AbstractRecordBatch.next():164
>
> org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():228
> org.apache.drill.exec.physical.impl.BaseRootExec.next():105
>
> org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext():155
> org.apache.drill.exec.physical.impl.BaseRootExec.next():95
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():233
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> java.lang.Thread.run():745
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)