There seems to be a lot of confusion on this thread.

We have large amount of code that separates physical representations of
data that can be possibly null versus data that can't be null. We have a
rigid concept in MajorType of whether data is nullable or required. If we
change from one to the other, that is a schema change inside of Drill (and
treated much the same as changing from Integer to Map). As we compile
expression trees, we have to constantly manage whether or not items are
null or not null. We also don't cast between the two. So UDF, Vector
classes, code generation, schema management, schema change are all much
more complicated because of this fact. I proposed this complexity initially
but looking at the continued cost and nominal benefit, think it was a
mistake.

The only time we use the "required" path is if the underlying data
guarantees that all the data will be non-null. I believe that path is
rarely used, poorly tested and provides only a small gain in performance
when used. In essence, it creates a permutation nightmare (just like us
having too many minor types) with marginal benefit.

The proposal here is to correct that mistake.

**Separately**, Drill should take better advantage of observed not-null
data.

>> You may not generate not-null data, but a lot of data is not-null.

Yes! You are 100% correct. Drill often chews through large amounts of data
that is annotated as nullable but has no nulls. For example, we run
benchmarks on TPCH data. The TPCH dataset doesn't have nulls. However, we
store the data as nullable (to be consistent with how virtually all systems
generate the data). As such, *Drill uses the nullable path* for the
entirety of execution. This is a great opportunity for performance
improvements. However, it is orthogonal to whether we remove the code path
above ** since it doesn't use it**. Ultimately we should allow the
execution engine to decide the operation path **rather than having a schema
level concept** that creates more code combinations and schema change.

My additional perspective is that having the mistake cruft above means that
doing the right thing of using observed nulls instead of annotated nulls is
substantially harder to implement and reduces the likelihood that it will
be implemented.

With regards to columnar benefits for calculations (which I again argue is
actually orthogonal to the initial proposal), I put together an ideal
condition test. In reality, we have more indirection and I'd actually
expect a larger benefit moving to columnar null evaluation than is this
test. (For example: (1) everybody still runs with bounds checking which
introduces an additional check for each null bit and (2) we always read
memory values in addition to null bits before inspecting the null bits). As
you can see below, having a columnar approach means that performance varies
little depending on nullability. Optimizing for the columnar no-nulls case
provides 5-6% additional performance which seems like a late optimization
compared to where we should be focused: moving to columnar execution.

Benchmark                                          Mode  Cnt     Score
Error  Units
ColumnarComparisons.a_plus_b_columnar              avgt  200  2059.743 ±
9.625  ns/op
ColumnarComparisons.a_plus_b_non_null              avgt  200  1934.380 ±
 10.279  ns/op
ColumnarComparisons.a_plus_b_current_drill         avgt  200  6737.569 ±
396.452  ns/op
ColumnarComparisons.a_plus_b_plus_c_columnar       avgt  200  2565.702 ±
 12.139  ns/op
ColumnarComparisons.a_plus_b_plus_c_non_null       avgt  200  2437.322 ±
 12.875  ns/op
ColumnarComparisons.a_plus_b_plus_c_current_drill  avgt  200  9010.913 ±
475.392  ns/op

This comes out as:

columnar a+b 0.5ns/record
current a+b 1.7ns/record
no-null a+b 0.5ns/record
columnar a+b+c 0.6ns/record
current a+b+c 2.25ns/record
no-null a+b+c 0.6ns/record

relative differences:
columnar versus current (a+b) : 3.2x
columnar versus current (a+b+c) : 3.5x
columnar no-nulls eval null: 1.06x
columnar no-nulls eval null: 1.05x

Code here: https://gist.github.com/jacques-n/70fa5afdeadba28ea398





--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Mar 23, 2016 at 11:58 AM, Parth Chandra <par...@apache.org> wrote:

> Hmm. I may not have expressed my thoughts clearly.
> What I was suggesting was that 'non-null' data exists in all data sets. (I
> have at least two data sets from users with Drill in production (sorry,
> cannot share the data), that have required fields in parquet files). The
> fields may not be marked as such in the metadata, or the data source may
> not have any such metadata, but if we can identify the type as non-null,
> then we can (and should) take advantage of it.
> If we are already taking advantage of it, then we should not make any
> changes without understanding the tradeoffs.
> So in the spirit of understanding that, I'd like to ask two questions -
> 1) Where specifically are you suggesting code complexity will decrease. You
> mentioned UDFs. Where else do you see the code is more complex?
> 2) Do we have any experiments to show whether columnar processing benefits
> from eliminating required fields?
>
>
>
> On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
> > I agree that we should focus on real benefits versus theories.
> >
> > Reduction in code complexity is a real benefit. Performance benefits from
> > having required types is theoretical. Dot drill files don't exist so they
> > should have little bearing on this conversation.
> >
> > We rarely generate required data. Most tools never generate it. The
> reason
> > the question is about actual deployments is that would be a real factor
> to
> > counterbalance the drive for code simplification rather than something
> > theoretical. A theoretical future performance regression shouldn't stop
> > code improvement. If it did, we wouldn't make any progress.
> >
> > What about your own internal benchmark tests. If removing required types
> > doesn't impact them, doesn't that mean this hasn't been a point of focus?
> > On Mar 22, 2016 8:36 PM, "Parth Chandra" <par...@apache.org> wrote:
> >
> > > I don't know if the main question is whether people have parquet (or
> > other
> > > ) files which have required fields or not. With something like a dot
> > drill
> > > file, a user can supply schema or format for data that does not carry
> > > schema, and we can certainly use the same to indicate knowledge of
> > > nullability. The question is whether we can take advantage of knowing
> > > whether data is null or not to get better performance.
> > >
> > > Any argument that applies to taking advantage of non-nullability at the
> > > batch level applies to taking advantage of non-nullability at the
> schema
> > > level.
> > >
> > > I'm not entirely convinced that the reduction of code complexity is
> > > ultimately leading to performance gain. Sure, it improves
> > maintainability,
> > > but what specific improvements are you thinking of that will increase
> > > performance?
> > >
> > > If you recommend some areas of improvement that become possible as a
> > result
> > > of this change, then I would suggest we run some experiments before we
> > make
> > > any change.
> > >
> > > It is a capital mistake to theorize before one has data, etc...
> > >
> > > A 15% performance drop is not something to be ignored, I would think.
> > >
> > > Parth
> > >
> > > On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <jacq...@dremio.com>
> > > wrote:
> > > >
> > > > Re Performance:
> > > >
> > > > I think the main question is what portion of people's data is
> actually
> > > > marked as non-nullable in Parquet files? (We already treat json,
> avro,
> > > > kudu, and hbase (except row key) as nullable. We do treat csv as
> > > > non-nullable (array) but I think these workloads start with
> conversion
> > to
> > > > Parquet.)  Early on, we typically benchmarked Drill using required
> > fields
> > > > in Parquet. At the time, we actually hacked the Pig code to get
> > something
> > > > to even generate this format. (I believe, to this day, Pig only
> > generates
> > > > nullable fields in Parquet.) After some time, we recognized that
> > > basically
> > > > every tool was producing Parquet files that were nullable and
> > ultimately
> > > > moved the benchmark infrastructure to using nullable types to do a
> > better
> > > > job of representing real-world workloads.
> > > >
> > > > Based on my (fuzzy) recollection, working with nullable types had a
> > > 10-15%
> > > > performance impact versus working on required types so I think there
> > is a
> > > > performance impact but I think the population of users who have
> > > > non-nullable Parquet files are small. If I recall, I believe Impala
> > also
> > > > creates nullable Parquet files. Not sure what Spark does. I believe
> > Hive
> > > > has also made this change recently or is doing it (deprecating
> > non-nulls
> > > in
> > > > their internals).
> > > >
> > > > If we move forward with this, I would expect there initially would
> be a
> > > > decrease in performance when data is held as non-nullable given we
> > > > previously observed this. However, I believe the reduction in code
> > > > complexity would leads us to improve other things more quickly. Which
> > > leads
> > > > me to...
> > > >
> > > > Re: Why
> > > >
> > > > Drill suffers from code complexity. This hurts forward progress. One
> > > > example is the fact that we have to generate all nullable
> permutations
> > of
> > > > functions. (For example, if we have three arguments, we have to
> > generate
> > > 8
> > > > separate functions to work with the combination of argument
> > > nullabilities).
> > > > This leads to vastly more reliance on compile-time templating which
> is
> > a
> > > > maintenance headache. Additionally, it makes the runtime code
> > generation
> > > > more complicated and error prone. Testing is also more expensive
> > because
> > > we
> > > > now have twice as many paths to both validate and maintain.
> > > Realistically,
> > > > we should try to move to more columnar algorithms, which would
> provide
> > a
> > > > bigger lift than this declared schema nullability optimization. This
> is
> > > > because many workloads have rare nulls so we can actually optimize
> > better
> > > > at the batch level. Creating three code paths (nullable observed
> > > non-null,
> > > > nullable observed null and non-null) make this substantially more
> > > > complicated. We want to invest in this area but the code complexity
> of
> > > > nullable versus required makes this tasks less likely to
> happen/harder.
> > > So,
> > > > in essence, I'm arguing that it is a small short-term loss that leads
> > to
> > > > better code quality and ultimately faster performance.
> > > >
> > > > Do others have real-world observations on the frequency of required
> > > fields
> > > > in Parquet files?
> > > >
> > > > thanks,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <par...@apache.org>
> > > wrote:
> > > >
> > > > > I'm not entirely convinced that this would have no performance
> > impact.
> > > Do
> > > > > we have any experiments?
> > > > >
> > > > >
> > > > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <
> jacq...@dremio.com>
> > > > > wrote:
> > > > >
> > > > > > My suggestion is we use explicit observation at the batch level.
> If
> > > there
> > > > > > are no nulls we can optimize this batch. This would ultimately
> > > improve
> > > > > over
> > > > > > our current situation where most parquet and all json data is
> > > nullable so
> > > > > > we don't optimize. I'd estimate that the vast majority of Drills
> > > > > workloads
> > > > > > are marked nullable whether they are or not. So what we're really
> > > > > > suggesting is deleting a bunch of code which is rarely in the
> > > execution
> > > > > > path.
> > > > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansi...@apache.org>
> > wrote:
> > > > > >
> > > > > > > I was thinking about it more after sending the previous
> concerns.
> > > > > Agree,
> > > > > > > this is an execution side change...but some details need to be
> > > worked
> > > > > > out.
> > > > > > > If the planner indicates to the executor that a column is
> > > non-nullable
> > > > > > (e.g
> > > > > > > a primary key),  the run-time generated code is more efficient
> > > since it
> > > > > > > does not have to check the null bit.  Are you thinking we would
> > use
> > > the
> > > > > > > existing nullable vector and add some additional metadata (at a
> > > record
> > > > > > > batch level rather than record level) to indicate
> > non-nullability ?
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <
> > > jacq...@dremio.com
> > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Aman, I believe both Steven and I were only suggesting
> > > removal
> > > > > only
> > > > > > > > from execution, not planning. It seems like your concerns are
> > all
> > > > > > related
> > > > > > > > to planning. Iit seems like the real tradeoffs in execution
> are
> > > > > > nominal.
> > > > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansi...@apache.org>
> > > wrote:
> > > > > > > >
> > > > > > > > > While it is true that there is code complexity due to the
> > > required
> > > > > > > type,
> > > > > > > > > what would we be trading off ?  some important
> > considerations:
> > > > > > > > >   - We don't currently have null count statistics which
> would
> > > need
> > > > > to
> > > > > > > be
> > > > > > > > > implemented for various data sources
> > > > > > > > >   - Primary keys in the RDBMS sources (or rowkeys in hbase)
> > are
> > > > > > always
> > > > > > > > > non-null, and although today we may not be doing
> > optimizations
> > > to
> > > > > > > > leverage
> > > > > > > > > that,  one could easily add a rule that converts  WHERE
> > > primary_key
> > > > > > IS
> > > > > > > > NULL
> > > > > > > > > to a FALSE filter.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> > > > > > > doshin...@commvault.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jacques,
> > > > > > > > > > Marginally related to this, I made a small change in
> PR-372
> > > > > > > > (DRILL-4184)
> > > > > > > > > > to support variable widths for decimal quantities in
> > Parquet.
> > >  I
> > > > > > > found
> > > > > > > > > the
> > > > > > > > > > (decimal) vectoring code to be very difficult to
> understand
> > > > > > (probably
> > > > > > > > > > because it's overly complex, but also because I'm new to
> > > Drill
> > > > > code
> > > > > > > in
> > > > > > > > > > general), so I made a small, surgical change in my pull
> > > request
> > > > > to
> > > > > > > > > support
> > > > > > > > > > keeping track of variable widths (lengths) and null
> > booleans
> > > > > within
> > > > > > > the
> > > > > > > > > > existing fixed width decimal vectoring scheme.  Can my
> > > changes be
> > > > > > > > > > reviewed/accepted, and then we discuss how to fix
> properly
> > > > > > long-term?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Dave Oshinsky
> > > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jacques Nadeau [mailto:jacq...@dremio.com]
> > > > > > > > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > > > > > > > To: dev
> > > > > > > > > > Subject: Re: [DISCUSS] Remove required type
> > > > > > > > > >
> > > > > > > > > > Definitely in support of this. The required type is a
> huge
> > > > > > > maintenance
> > > > > > > > > and
> > > > > > > > > > code complexity nightmare that provides little to no
> > benefit.
> > > As
> > > > > > you
> > > > > > > > > point
> > > > > > > > > > out, we can do better performance optimizations though
> null
> > > count
> > > > > > > > > > observation since most sources are nullable anyway.
> > > > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <
> > > ste...@dremio.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I have been thinking about this for a while now, and I
> > feel
> > > it
> > > > > > > would
> > > > > > > > > > > be a good idea to remove the Required vector types from
> > > Drill,
> > > > > > and
> > > > > > > > > > > only use the Nullable version of vectors. I think this
> > will
> > > > > > greatly
> > > > > > > > > > simplify the code.
> > > > > > > > > > > It will also simplify the creation of UDFs. As is, if a
> > > > > function
> > > > > > > has
> > > > > > > > > > > custom null handling (i.e. INTERNAL), the function has
> to
> > > be
> > > > > > > > > > > separately implemented for each permutation of
> > nullability
> > > of
> > > > > the
> > > > > > > > > > > inputs. But if drill data types are always nullable,
> this
> > > > > > wouldn't
> > > > > > > > be a
> > > > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > I don't think there would be much impact on
> performance.
> > In
> > > > > > > practice,
> > > > > > > > > > > I think the required type is used very rarely. And
> there
> > > are
> > > > > > other
> > > > > > > > > > > ways we can optimize for when a column is known to have
> > no
> > > > > nulls.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ***************************Legal
> > > > > > > Disclaimer***************************
> > > > > > > > > > "This communication may contain confidential and
> privileged
> > > > > > material
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > sole use of the intended recipient. Any unauthorized
> > review,
> > > use
> > > > > or
> > > > > > > > > > distribution
> > > > > > > > > > by others is strictly prohibited. If you have received
> the
> > > > > message
> > > > > > by
> > > > > > > > > > mistake,
> > > > > > > > > > please advise the sender by reply email and delete the
> > > message.
> > > > > > Thank
> > > > > > > > > you."
> > > > > > > > > >
> > > > > > >
> > > **********************************************************************
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Reply via email to