I've started implementing a state machine to track this and detect the
cases that you are mentioning.
We can use it as a tool to track this better.
It makes sense to me to be able to provide some kind of configuration to
the vector when creating it.

On Fri, Aug 28, 2015 at 11:05 AM, Jason Altekruse <[email protected]>
wrote:

> Steven makes a good point, this is not something that can currently be more
> strictly enforced, because of the way we use the primitive vectors inside
> the composite ones like the variable length vectors.
>
> We should try to come up with all of the "constraints" that are violated
> (or not explicitly constrained) due to current usages of the vectors. One
> more I can think of is the maximum size constraint. This only applies to
> batches, not the vectors themselves. I know we have gotten caught a few
> times with operators that were not constraining batch size, and
> unfortunately with the way we use the vectors to represent lists and nested
> structures, we can have valid vectors with many more than the max batch
> size (65K) number of elements. There should be a consistent way that we
> constrain the batch size, as the vectors actually accept individual writes
> and have their counts explicitly set, they would seem to be good candidates
> for enforcing such a constraint, but the way they are re-purposed makes
> this currently impossible without breaking other functionality.
>
> Would it make any sense to pass an optional max size into the creation of a
> vector? We would not want to check it with each write, but we could enforce
> it only in the reallocation method. This would allow for the vectors to
> sometimes have a constrained size that is enforced and other times they
> could be allowed to have an unlimited size. We would likely want to make it
> very clear when the value would be set, and updating all current vector
> creations appropriately could be a rather large change.
>
> Similarly, we may be able to create a distinction between an interior
> vector and a "public" vector to enforce this "allocate > mutate >
> setvaluecount > access > clear" pattern for vectors that represent an
> actual column of data in the SQL sense. This seems a little trickier
> because it involves not only changing more of the vectors to enforce this
> access pattern, but further complicating it to make this optional.
>
> On Thu, Aug 27, 2015 at 9:03 PM, Aman Sinha <[email protected]> wrote:
>
> > Based on the discussion it looks like doing what the
> first_value/last_value
> > window function needs from the value vectors without violating the
> > recommended state transition requires a bit more thought such that we
> don't
> > introduce regression. Since testing is blocked on these for 1.2, can
> Hakim
> > proceed with his current fix ?  We could create a JIRA to revisit it post
> > 1.2...
> >
> > Aman
> >
> > On Wed, Aug 26, 2015 at 3:07 PM, Julien Le Dem <[email protected]>
> wrote:
> >
> > > I can take a look at the Vectors and add asserts to enforce the
> contract
> > is
> > > respected.
> > >
> > > On Wed, Aug 26, 2015 at 2:52 PM, Steven Phillips <[email protected]>
> > > wrote:
> > >
> > > > One possible exception to the access pattern occurs when vectors wrap
> > > other
> > > > vectors. Specifically, the offset vectors in Variable Length and
> > Repeated
> > > > vectors. These vectors are accessed and mutated multiple times. If we
> > are
> > > > going to implement strict enforcement, we need to consider that case.
> > > >
> > > > On Tue, Aug 25, 2015 at 7:15 PM, Jacques Nadeau <[email protected]>
> > > > wrote:
> > > >
> > > > > Yes, by recommendation is to correct the usage in StreamingAggBatch
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Tue, Aug 25, 2015 at 4:52 PM, Abdel Hakim Deneche <
> > > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I think zeroVector() is mainly used to fill the vector with
> zeros,
> > > > which
> > > > > is
> > > > > > fine if you call it while the vector is in "mutate" state, but
> > > > > > StreamingAggBatch does actually call it after setting the value
> > count
> > > > of
> > > > > > the value vector which is against the paradigm.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 25, 2015 at 3:51 PM, Jacques Nadeau <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > In all but one situations, this is an internal concern (making
> > sure
> > > > to
> > > > > > zero
> > > > > > > out the memory).  For fixed width vectors, there is an
> assumption
> > > > that
> > > > > an
> > > > > > > initial allocation is clean memory (e.g. all zeros in the faces
> > of
> > > an
> > > > > int
> > > > > > > vector).  So this should be pulled off a public vector
> interface.
> > > > The
> > > > > > one
> > > > > > > place where it is being used today is StreamingAggBatch and I
> > think
> > > > we
> > > > > > > should fix that to follow the state paradigm described above.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jacques Nadeau
> > > > > > > CTO and Co-Founder, Dremio
> > > > > > >
> > > > > > > On Tue, Aug 25, 2015 at 3:41 PM, Abdel Hakim Deneche <
> > > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Another question: FixedWidthVector interface defines a
> > > zeroVector()
> > > > > > > method
> > > > > > > > that
> > > > > > > > "Zero out the underlying buffer backing this vector"
> according
> > to
> > > > > it's
> > > > > > > > javadoc.
> > > > > > > >
> > > > > > > > Where does this method fit in the value vector states
> described
> > > > > > earlier ?
> > > > > > > > it doesn't clear the vector yet it doesn't reset everything
> to
> > > the
> > > > > > after
> > > > > > > > allocate state.
> > > > > > > >
> > > > > > > > On Tue, Aug 25, 2015 at 10:46 AM, Abdel Hakim Deneche <
> > > > > > > > [email protected]
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > One more question about the transition from allocate ->
> > mutate.
> > > > For
> > > > > > > Fixed
> > > > > > > > > width vectors and BitVector you can actually call setSafe()
> > > > without
> > > > > > > > calling
> > > > > > > > > allocateNew() first and it will work. Should it throw an
> > > > exception
> > > > > > > > instead
> > > > > > > > > ?
> > > > > > > > > not calling allocateNew() has side effects that could cause
> > > > > setSafe()
> > > > > > > to
> > > > > > > > > throw an OversizedAllocationException if you call setSafe()
> > > then
> > > > > > > clear()
> > > > > > > > > multiple times.
> > > > > > > > >
> > > > > > > > > On Tue, Aug 25, 2015 at 10:01 AM, Chris Westin <
> > > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Maybe we should start by putting these rules in a comment
> in
> > > the
> > > > > > value
> > > > > > > > >> vector base interfaces? The lack of such information is
> why
> > > > there
> > > > > > are
> > > > > > > > >> deviations and other expectations.
> > > > > > > > >>
> > > > > > > > >> On Tue, Aug 25, 2015 at 8:22 AM, Jacques Nadeau <
> > > > > [email protected]
> > > > > > >
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > There are a few unspoken "rules" around vectors:
> > > > > > > > >> >
> > > > > > > > >> > - values need to be written in order (e.g. index 0, 1,
> 2,
> > 5)
> > > > > > > > >> > - null vectors start with all values as null before
> > writing
> > > > > > anything
> > > > > > > > >> > - for variable width types, the offset vector should be
> > all
> > > > > zeros
> > > > > > > > before
> > > > > > > > >> > writing
> > > > > > > > >> > - you must call setValueCount before a vector can be
> read
> > > > > > > > >> > - you should never write to a vector once it has been
> > read.
> > > > > > > > >> >
> > > > > > > > >> > The ultimate goal we should get to the point where you
> the
> > > > > > > interfaces
> > > > > > > > >> > guarantee this order of operation:
> > > > > > > > >> >
> > > > > > > > >> > allocate > mutate > setvaluecount > access > clear (or
> > > > allocate
> > > > > to
> > > > > > > > start
> > > > > > > > >> > the process over, xxx).  Any deviation from this pattern
> > > > should
> > > > > > > result
> > > > > > > > >> in
> > > > > > > > >> > exception.  We should do this only in debug mode as this
> > > code
> > > > is
> > > > > > > > >> extremely
> > > > > > > > >> > performance sensitive.  Operations like transfer should
> be
> > > > built
> > > > > > on
> > > > > > > > top
> > > > > > > > >> of
> > > > > > > > >> > this state model.  (In that case, it would mean src
> moves
> > to
> > > > > clear
> > > > > > > > state
> > > > > > > > >> > and target moves to access state.  It also means that
> > > transfer
> > > > > > > should
> > > > > > > > >> only
> > > > > > > > >> > work in access state.)
> > > > > > > > >> >
> > > > > > > > >> > If we need special purpose data structures that don't
> > > operate
> > > > in
> > > > > > > these
> > > > > > > > >> > ways, we should make sure to keep them separate rather
> > than
> > > > > trying
> > > > > > > to
> > > > > > > > >> > accommodate a deviation from this pattern in the core
> > vector
> > > > > code.
> > > > > > > > >> >
> > > > > > > > >> > I wrote xxx above because I see the purpose of
> zeroVectors
> > > as
> > > > > > being
> > > > > > > a
> > > > > > > > >> reset
> > > > > > > > >> > on the vector state back to the original state.  Maybe
> we
> > > > should
> > > > > > > > >> actually
> > > > > > > > >> > call it 'reset' rather than 'zeroVectors'.  This would
> > > > basically
> > > > > > > pick
> > > > > > > > >> up at
> > > > > > > > >> > mutate mode again.
> > > > > > > > >> >
> > > > > > > > >> > Since these rules were never formalized, I'm sure there
> > are
> > > a
> > > > > few
> > > > > > > > places
> > > > > > > > >> > where we currently deviate.  We should enforce these
> rules
> > > and
> > > > > > then
> > > > > > > > get
> > > > > > > > >> > those issues fixed.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > --
> > > > > > > > >> > Jacques Nadeau
> > > > > > > > >> > CTO and Co-Founder, Dremio
> > > > > > > > >> >
> > > > > > > > >> > On Tue, Aug 25, 2015 at 8:02 AM, Abdel Hakim Deneche <
> > > > > > > > >> > [email protected]>
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Another important point to keep in mind here:
> > > > > > > > >> ValueVectorWriteExpression
> > > > > > > > >> > > operates under the "undocumented" assumption that the
> > > > > > destination
> > > > > > > > >> vector
> > > > > > > > >> > is
> > > > > > > > >> > > empty, this way it can safely skip writing null
> values.
> > In
> > > > the
> > > > > > > case
> > > > > > > > of
> > > > > > > > >> > > window functions I am using a value vector as an
> > internal
> > > > > buffer
> > > > > > > to
> > > > > > > > >> hold
> > > > > > > > >> > > values between batches which voids the assumption.
> > > > > > > > >> > > If this assumption is indeed correct then adding
> > > zeroVector
> > > > to
> > > > > > > value
> > > > > > > > >> > > vectors is indeed the way to go.
> > > > > > > > >> > >
> > > > > > > > >> > > On Mon, Aug 24, 2015 at 8:30 PM, Jacques Nadeau <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > In general, let's try to avoid extending the core
> > > > structures
> > > > > > > like
> > > > > > > > >> value
> > > > > > > > >> > > > vector read and write expressions for a single
> > operator.
> > > > > > > > Zerovector
> > > > > > > > >> is
> > > > > > > > >> > > > trivial to implement so let's resolve that way
> > (trivial
> > > > > since
> > > > > > > the
> > > > > > > > >> > > > underlying vector already has it and we just need to
> > > > > delegate
> > > > > > > > down).
> > > > > > > > >> > > > On Aug 24, 2015 3:36 PM, "Aman Sinha" <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > I am reviewing Hakim's patch for DRILL-3668
> > > (first_value
> > > > > > > window
> > > > > > > > >> > > function
> > > > > > > > >> > > > > incorrect result).  His code uses
> > > > > ValueVectorWriteExpression
> > > > > > > to
> > > > > > > > >> set
> > > > > > > > >> > > > values
> > > > > > > > >> > > > > in an internal batch which get re-used across
> > > different
> > > > > > > > >> partitions of
> > > > > > > > >> > > the
> > > > > > > > >> > > > > window function.  Ideally, we just want to zero
> out
> > > the
> > > > > > vector
> > > > > > > > >> rather
> > > > > > > > >> > > > than
> > > > > > > > >> > > > > calling clear() since clear() will release the
> > buffer.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > However, currently zeroVectors() is only supported
> > by
> > > > > > > > >> > FixedWidthVector,
> > > > > > > > >> > > > not
> > > > > > > > >> > > > > VariableWidthVector.  * Should there be such an
> > > > interface
> > > > > > for
> > > > > > > > >> > variable
> > > > > > > > >> > > > > width ? * The implementation could zero out just
> the
> > > > > offset
> > > > > > > > >> vector.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > In the absence of such an interface, Hakim has
> > added a
> > > > > > boolean
> > > > > > > > >> flag
> > > > > > > > >> > > > > witeNulls to ValueVectorWriteExpression (see
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/adeneche/incubator-drill/commit/cab73cd1a50163dd25fe0f9c55c264780ea3616d
> > > > > > > > >> > > > > )
> > > > > > > > >> > > > >  and is conditionally doing the null-ing out in
> the
> > > > > > generated
> > > > > > > > >> code.
> > > > > > > > >> > It
> > > > > > > > >> > > > > won't affect the normal code path, it would get
> used
> > > for
> > > > > > > > specific
> > > > > > > > >> > > window
> > > > > > > > >> > > > > functions.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I am thinking of committing his patch and tracking
> > the
> > > > > > > > >> zeroVectors()
> > > > > > > > >> > > > > enhancement separately (if people agree that it
> > would
> > > be
> > > > > > > > useful).
> > > > > > > > >> > Let
> > > > > > > > >> > > me
> > > > > > > > >> > > > > know...
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Aman
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > --
> > > > > > > > >> > >
> > > > > > > > >> > > Abdelhakim Deneche
> > > > > > > > >> > >
> > > > > > > > >> > > Software Engineer
> > > > > > > > >> > >
> > > > > > > > >> > >   <http://www.mapr.com/>
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > Now Available - Free Hadoop On-Demand Training
> > > > > > > > >> > > <
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Abdelhakim Deneche
> > > > > > > > >
> > > > > > > > > Software Engineer
> > > > > > > > >
> > > > > > > > >   <http://www.mapr.com/>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Abdelhakim Deneche
> > > > > > > >
> > > > > > > > Software Engineer
> > > > > > > >
> > > > > > > >   <http://www.mapr.com/>
> > > > > > > >
> > > > > > > >
> > > > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > > > <
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Abdelhakim Deneche
> > > > > >
> > > > > > Software Engineer
> > > > > >
> > > > > >   <http://www.mapr.com/>
> > > > > >
> > > > > >
> > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>



-- 
Julien

Reply via email to