[
https://issues.apache.org/jira/browse/DRILL-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424555#comment-16424555
]
salim achouche commented on DRILL-6202:
---------------------------------------
This is my take on the Drill boundary checks:
_*Short Term -*_
* Ideally, the Drill boundary checks should be always on as long as
** The impact of a Drillbit process crash (or data corruption) is big since
there is no built-in fault-tolerance
** The code is extensible and extensions are allowed to access direct memory
* Having said that, my priority would have been to minimize the cost
associated with these checks instead of completely turning them off
** This is no different from Java's behavior with regard to array boundary
checks
* How do we do that? Actually there are multiple strategies (which could be
combined)
** Fine-grained checks
*** Add boundary checks within +all+ DrillBuf data accessors
*** Invoke the accessor API within a loop and ensure the JVM is able to
optimize the checks
**** This will help you answer the question(s) around whether we access DM
directly or through Netty
** Caller overwrite
*** Allow caller to disable checks that are deemed too expensive or not easily
optimizable by the HotSport (e.g., Reference Checks)
*** This pattern works well for a centralized layer (e.g., Paul's accessor
framework) but not for extensions as they cannot be always trusted to do the
right thing
*** To mitigate this, we could always have an auxiliary flag that will force
execution of such checks if set; that is overwrite untrusted callers
**** This should be done if a crash or corruption is observed
** Bulk Processing
*** Bulk accessor APIs will allow +all the checks+ to be performed but with a
minimal cost (amortized)
_*Long Term -*_
* With the new Accessor Framework in place all DM checks should be primarily
within this layer
** The promise of this layer is that other memory formats can be transparently
substituted (e.g., Apache Arrow)
* The question on whether the runtime checks are enabled by default becomes
less important
** The chance of crash / corruption is highly minimized
** It should be rather easy for this layer to optimize the runtime checks;
then the question becomes "why not?"
_*Question -*_
* Your Jira doesn't quite explain the
** "why" you intend to deprecate the IndexOutOfBoundException (since it is an
unchecked exception)
** And replace it with what other mechanism?
*NOTE -*
* To minimize bookkeeping complexity, Drill operators will upfront allocate
memory for the variable length value vectors to minimize the cost of re-allocs
* The setSafe() APIs are called (at least for Parquet) when the associated
column
** Has enough VV space to insert the new value(s)
** Can extend the current VV to the next-power-of-two; the setSafe() api is
responsible for extending the vector(s)
> Deprecate usage of IndexOutOfBoundsException to re-alloc vectors
> ----------------------------------------------------------------
>
> Key: DRILL-6202
> URL: https://issues.apache.org/jira/browse/DRILL-6202
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Vlad Rozov
> Assignee: Vlad Rozov
> Priority: Major
> Fix For: 1.14.0
>
>
> As bounds checking may be enabled or disabled, using
> IndexOutOfBoundsException to resize vectors is unreliable. It works only when
> bounds checking is enabled.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)