[ 
https://issues.apache.org/jira/browse/DRILL-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424555#comment-16424555
 ] 

salim achouche commented on DRILL-6202:
---------------------------------------

This is my take on the Drill boundary checks:

_*Short Term -*_
 * Ideally, the Drill boundary checks should be always on as long as 
 ** The impact of a Drillbit process crash (or data corruption) is big since 
there is no built-in fault-tolerance
 ** The code is extensible and extensions are allowed to access direct memory
 * Having said that, my priority would have been to minimize the cost 
associated with these checks instead of completely turning them off
 ** This is no different from Java's behavior with regard to array boundary 
checks
 * How do we do that? Actually there are multiple strategies (which could be 
combined)
 ** Fine-grained checks
 *** Add boundary checks within +all+ DrillBuf data accessors
 ***  Invoke the accessor API within a loop and ensure the JVM is able to 
optimize the checks 
 **** This will help you answer the question(s) around whether we access DM 
directly or through Netty 
 ** Caller overwrite
 *** Allow caller to disable checks that are deemed too expensive or not easily 
optimizable by the HotSport (e.g., Reference Checks)
 *** This pattern works well for a centralized layer (e.g., Paul's accessor 
framework) but not for extensions as they cannot be always trusted to do the 
right thing
 *** To mitigate this, we could always have an auxiliary flag that will force 
execution of such checks if set; that is overwrite untrusted callers

 **** This should be done if a crash or corruption is observed
 ** Bulk Processing

 *** Bulk accessor APIs will allow +all the checks+ to be performed but with a 
minimal cost (amortized)

_*Long Term -*_
 * With the new Accessor Framework in place all DM checks should be primarily 
within this layer
 ** The promise of this layer is that other memory formats can be transparently 
substituted (e.g., Apache Arrow)
 * The question on whether the runtime checks are enabled by default becomes 
less important
 ** The chance of crash / corruption is highly minimized
 ** It should be rather easy for this layer to optimize the runtime checks; 
then the question becomes "why not?"

_*Question -*_
 * Your Jira doesn't quite explain the
 ** "why" you intend to deprecate the IndexOutOfBoundException (since it is an 
unchecked exception)
 ** And replace it with what other mechanism?

 

*NOTE -* 
 * To minimize bookkeeping complexity, Drill operators will upfront allocate 
memory for the variable length value vectors to minimize the cost of re-allocs
 * The setSafe() APIs are called (at least for Parquet) when the associated 
column
 ** Has enough VV space to insert the new value(s)
 ** Can extend the current VV to the next-power-of-two; the setSafe() api is 
responsible for extending the vector(s)

> Deprecate usage of IndexOutOfBoundsException to re-alloc vectors
> ----------------------------------------------------------------
>
>                 Key: DRILL-6202
>                 URL: https://issues.apache.org/jira/browse/DRILL-6202
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Vlad Rozov
>            Assignee: Vlad Rozov
>            Priority: Major
>             Fix For: 1.14.0
>
>
> As bounds checking may be enabled or disabled, using 
> IndexOutOfBoundsException to resize vectors is unreliable. It works only when 
> bounds checking is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to