Re: [DISCUSS] Remove required type

Julian Hyde Wed, 23 Mar 2016 11:57:41 -0700

Jacques,

Doesn't Drill detect the type of each column within each batch? If so,
does it (or could it) also detect that a particular column is not null
(again, within the batch)?


You may not generate not-null data, but a lot of data is not-null.

Let's not be too hasty to dismiss this as a theoretical argument. If
there are performance benefits to treating columns as not-null, it
might be worthwhile deducing that they are not-null.

Julian


On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau <[email protected]> wrote:
> I agree that we should focus on real benefits versus theories.
>
> Reduction in code complexity is a real benefit. Performance benefits from
> having required types is theoretical. Dot drill files don't exist so they
> should have little bearing on this conversation.
>
> We rarely generate required data. Most tools never generate it. The reason
> the question is about actual deployments is that would be a real factor to
> counterbalance the drive for code simplification rather than something
> theoretical. A theoretical future performance regression shouldn't stop
> code improvement. If it did, we wouldn't make any progress.
>
> What about your own internal benchmark tests. If removing required types
> doesn't impact them, doesn't that mean this hasn't been a point of focus?
> On Mar 22, 2016 8:36 PM, "Parth Chandra" <[email protected]> wrote:
>
>> I don't know if the main question is whether people have parquet (or other
>> ) files which have required fields or not. With something like a dot drill
>> file, a user can supply schema or format for data that does not carry
>> schema, and we can certainly use the same to indicate knowledge of
>> nullability. The question is whether we can take advantage of knowing
>> whether data is null or not to get better performance.
>>
>> Any argument that applies to taking advantage of non-nullability at the
>> batch level applies to taking advantage of non-nullability at the schema
>> level.
>>
>> I'm not entirely convinced that the reduction of code complexity is
>> ultimately leading to performance gain. Sure, it improves maintainability,
>> but what specific improvements are you thinking of that will increase
>> performance?
>>
>> If you recommend some areas of improvement that become possible as a result
>> of this change, then I would suggest we run some experiments before we make
>> any change.
>>
>> It is a capital mistake to theorize before one has data, etc...
>>
>> A 15% performance drop is not something to be ignored, I would think.
>>
>> Parth
>>
>> On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <[email protected]>
>> wrote:
>> >
>> > Re Performance:
>> >
>> > I think the main question is what portion of people's data is actually
>> > marked as non-nullable in Parquet files? (We already treat json, avro,
>> > kudu, and hbase (except row key) as nullable. We do treat csv as
>> > non-nullable (array) but I think these workloads start with conversion to
>> > Parquet.)  Early on, we typically benchmarked Drill using required fields
>> > in Parquet. At the time, we actually hacked the Pig code to get something
>> > to even generate this format. (I believe, to this day, Pig only generates
>> > nullable fields in Parquet.) After some time, we recognized that
>> basically
>> > every tool was producing Parquet files that were nullable and ultimately
>> > moved the benchmark infrastructure to using nullable types to do a better
>> > job of representing real-world workloads.
>> >
>> > Based on my (fuzzy) recollection, working with nullable types had a
>> 10-15%
>> > performance impact versus working on required types so I think there is a
>> > performance impact but I think the population of users who have
>> > non-nullable Parquet files are small. If I recall, I believe Impala also
>> > creates nullable Parquet files. Not sure what Spark does. I believe Hive
>> > has also made this change recently or is doing it (deprecating non-nulls
>> in
>> > their internals).
>> >
>> > If we move forward with this, I would expect there initially would be a
>> > decrease in performance when data is held as non-nullable given we
>> > previously observed this. However, I believe the reduction in code
>> > complexity would leads us to improve other things more quickly. Which
>> leads
>> > me to...
>> >
>> > Re: Why
>> >
>> > Drill suffers from code complexity. This hurts forward progress. One
>> > example is the fact that we have to generate all nullable permutations of
>> > functions. (For example, if we have three arguments, we have to generate
>> 8
>> > separate functions to work with the combination of argument
>> nullabilities).
>> > This leads to vastly more reliance on compile-time templating which is a
>> > maintenance headache. Additionally, it makes the runtime code generation
>> > more complicated and error prone. Testing is also more expensive because
>> we
>> > now have twice as many paths to both validate and maintain.
>> Realistically,
>> > we should try to move to more columnar algorithms, which would provide a
>> > bigger lift than this declared schema nullability optimization. This is
>> > because many workloads have rare nulls so we can actually optimize better
>> > at the batch level. Creating three code paths (nullable observed
>> non-null,
>> > nullable observed null and non-null) make this substantially more
>> > complicated. We want to invest in this area but the code complexity of
>> > nullable versus required makes this tasks less likely to happen/harder.
>> So,
>> > in essence, I'm arguing that it is a small short-term loss that leads to
>> > better code quality and ultimately faster performance.
>> >
>> > Do others have real-world observations on the frequency of required
>> fields
>> > in Parquet files?
>> >
>> > thanks,
>> > Jacques
>> >
>> >
>> >
>> > --
>> > Jacques Nadeau
>> > CTO and Co-Founder, Dremio
>> >
>> > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <[email protected]>
>> wrote:
>> >
>> > > I'm not entirely convinced that this would have no performance impact.
>> Do
>> > > we have any experiments?
>> > >
>> > >
>> > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <[email protected]>
>> > > wrote:
>> > >
>> > > > My suggestion is we use explicit observation at the batch level. If
>> there
>> > > > are no nulls we can optimize this batch. This would ultimately
>> improve
>> > > over
>> > > > our current situation where most parquet and all json data is
>> nullable so
>> > > > we don't optimize. I'd estimate that the vast majority of Drills
>> > > workloads
>> > > > are marked nullable whether they are or not. So what we're really
>> > > > suggesting is deleting a bunch of code which is rarely in the
>> execution
>> > > > path.
>> > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <[email protected]> wrote:
>> > > >
>> > > > > I was thinking about it more after sending the previous concerns.
>> > > Agree,
>> > > > > this is an execution side change...but some details need to be
>> worked
>> > > > out.
>> > > > > If the planner indicates to the executor that a column is
>> non-nullable
>> > > > (e.g
>> > > > > a primary key),  the run-time generated code is more efficient
>> since it
>> > > > > does not have to check the null bit.  Are you thinking we would use
>> the
>> > > > > existing nullable vector and add some additional metadata (at a
>> record
>> > > > > batch level rather than record level) to indicate non-nullability ?
>> > > > >
>> > > > >
>> > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <
>> [email protected]
>> >
>> > > > > wrote:
>> > > > >
>> > > > > > Hey Aman, I believe both Steven and I were only suggesting
>> removal
>> > > only
>> > > > > > from execution, not planning. It seems like your concerns are all
>> > > > related
>> > > > > > to planning. Iit seems like the real tradeoffs in execution are
>> > > > nominal.
>> > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <[email protected]>
>> wrote:
>> > > > > >
>> > > > > > > While it is true that there is code complexity due to the
>> required
>> > > > > type,
>> > > > > > > what would we be trading off ?  some important considerations:
>> > > > > > >   - We don't currently have null count statistics which would
>> need
>> > > to
>> > > > > be
>> > > > > > > implemented for various data sources
>> > > > > > >   - Primary keys in the RDBMS sources (or rowkeys in hbase) are
>> > > > always
>> > > > > > > non-null, and although today we may not be doing optimizations
>> to
>> > > > > > leverage
>> > > > > > > that,  one could easily add a rule that converts  WHERE
>> primary_key
>> > > > IS
>> > > > > > NULL
>> > > > > > > to a FALSE filter.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
>> > > > > [email protected]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Jacques,
>> > > > > > > > Marginally related to this, I made a small change in PR-372
>> > > > > > (DRILL-4184)
>> > > > > > > > to support variable widths for decimal quantities in Parquet.
>>  I
>> > > > > found
>> > > > > > > the
>> > > > > > > > (decimal) vectoring code to be very difficult to understand
>> > > > (probably
>> > > > > > > > because it's overly complex, but also because I'm new to
>> Drill
>> > > code
>> > > > > in
>> > > > > > > > general), so I made a small, surgical change in my pull
>> request
>> > > to
>> > > > > > > support
>> > > > > > > > keeping track of variable widths (lengths) and null booleans
>> > > within
>> > > > > the
>> > > > > > > > existing fixed width decimal vectoring scheme.  Can my
>> changes be
>> > > > > > > > reviewed/accepted, and then we discuss how to fix properly
>> > > > long-term?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Dave Oshinsky
>> > > > > > > >
>> > > > > > > > -----Original Message-----
>> > > > > > > > From: Jacques Nadeau [mailto:[email protected]]
>> > > > > > > > Sent: Monday, March 21, 2016 11:43 PM
>> > > > > > > > To: dev
>> > > > > > > > Subject: Re: [DISCUSS] Remove required type
>> > > > > > > >
>> > > > > > > > Definitely in support of this. The required type is a huge
>> > > > > maintenance
>> > > > > > > and
>> > > > > > > > code complexity nightmare that provides little to no benefit.
>> As
>> > > > you
>> > > > > > > point
>> > > > > > > > out, we can do better performance optimizations though null
>> count
>> > > > > > > > observation since most sources are nullable anyway.
>> > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <
>> [email protected]>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I have been thinking about this for a while now, and I feel
>> it
>> > > > > would
>> > > > > > > > > be a good idea to remove the Required vector types from
>> Drill,
>> > > > and
>> > > > > > > > > only use the Nullable version of vectors. I think this will
>> > > > greatly
>> > > > > > > > simplify the code.
>> > > > > > > > > It will also simplify the creation of UDFs. As is, if a
>> > > function
>> > > > > has
>> > > > > > > > > custom null handling (i.e. INTERNAL), the function has to
>> be
>> > > > > > > > > separately implemented for each permutation of nullability
>> of
>> > > the
>> > > > > > > > > inputs. But if drill data types are always nullable, this
>> > > > wouldn't
>> > > > > > be a
>> > > > > > > > problem.
>> > > > > > > > >
>> > > > > > > > > I don't think there would be much impact on performance. In
>> > > > > practice,
>> > > > > > > > > I think the required type is used very rarely. And there
>> are
>> > > > other
>> > > > > > > > > ways we can optimize for when a column is known to have no
>> > > nulls.
>> > > > > > > > >
>> > > > > > > > > Thoughts?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > ***************************Legal
>> > > > > Disclaimer***************************
>> > > > > > > > "This communication may contain confidential and privileged
>> > > > material
>> > > > > > for
>> > > > > > > > the
>> > > > > > > > sole use of the intended recipient. Any unauthorized review,
>> use
>> > > or
>> > > > > > > > distribution
>> > > > > > > > by others is strictly prohibited. If you have received the
>> > > message
>> > > > by
>> > > > > > > > mistake,
>> > > > > > > > please advise the sender by reply email and delete the
>> message.
>> > > > Thank
>> > > > > > > you."
>> > > > > > > >
>> > > > >
>> **********************************************************************
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>

Re: [DISCUSS] Remove required type

Reply via email to