If the function is strict (i.e. produces null result if any of the arguments are null) then the user needs to provide only one implementation: the one where all arguments are not null. Drill would check for null arguments before calling it.
Most of the SQL built-in functions are strict. Aren't most many user functions also? On Wed, Mar 23, 2016 at 1:48 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote: > If the main complexity comes from UDF, i.e, user has to implement each > permutation of nullability, I feel it might be possible to allow users > to only provide an implementation (nullable input), and Drill changes > function resolution logic and run-time code-gen logic to make that > work. That is, treat required input as nullable input for this UDF. > This will restrict the performance impact to query "only" involving > this UDF, not to any general query. > > User also has a choice of trade-off when they implement UDF: multiple > implementations means better performance, single implementation means > performance penalty. > > > > On Wed, Mar 23, 2016 at 11:58 AM, Parth Chandra <par...@apache.org> wrote: >> Hmm. I may not have expressed my thoughts clearly. >> What I was suggesting was that 'non-null' data exists in all data sets. (I >> have at least two data sets from users with Drill in production (sorry, >> cannot share the data), that have required fields in parquet files). The >> fields may not be marked as such in the metadata, or the data source may >> not have any such metadata, but if we can identify the type as non-null, >> then we can (and should) take advantage of it. >> If we are already taking advantage of it, then we should not make any >> changes without understanding the tradeoffs. >> So in the spirit of understanding that, I'd like to ask two questions - >> 1) Where specifically are you suggesting code complexity will decrease. You >> mentioned UDFs. Where else do you see the code is more complex? >> 2) Do we have any experiments to show whether columnar processing benefits >> from eliminating required fields? >> >> >> >> On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau <jacq...@dremio.com> wrote: >> >>> I agree that we should focus on real benefits versus theories. >>> >>> Reduction in code complexity is a real benefit. Performance benefits from >>> having required types is theoretical. Dot drill files don't exist so they >>> should have little bearing on this conversation. >>> >>> We rarely generate required data. Most tools never generate it. The reason >>> the question is about actual deployments is that would be a real factor to >>> counterbalance the drive for code simplification rather than something >>> theoretical. A theoretical future performance regression shouldn't stop >>> code improvement. If it did, we wouldn't make any progress. >>> >>> What about your own internal benchmark tests. If removing required types >>> doesn't impact them, doesn't that mean this hasn't been a point of focus? >>> On Mar 22, 2016 8:36 PM, "Parth Chandra" <par...@apache.org> wrote: >>> >>> > I don't know if the main question is whether people have parquet (or >>> other >>> > ) files which have required fields or not. With something like a dot >>> drill >>> > file, a user can supply schema or format for data that does not carry >>> > schema, and we can certainly use the same to indicate knowledge of >>> > nullability. The question is whether we can take advantage of knowing >>> > whether data is null or not to get better performance. >>> > >>> > Any argument that applies to taking advantage of non-nullability at the >>> > batch level applies to taking advantage of non-nullability at the schema >>> > level. >>> > >>> > I'm not entirely convinced that the reduction of code complexity is >>> > ultimately leading to performance gain. Sure, it improves >>> maintainability, >>> > but what specific improvements are you thinking of that will increase >>> > performance? >>> > >>> > If you recommend some areas of improvement that become possible as a >>> result >>> > of this change, then I would suggest we run some experiments before we >>> make >>> > any change. >>> > >>> > It is a capital mistake to theorize before one has data, etc... >>> > >>> > A 15% performance drop is not something to be ignored, I would think. >>> > >>> > Parth >>> > >>> > On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <jacq...@dremio.com> >>> > wrote: >>> > > >>> > > Re Performance: >>> > > >>> > > I think the main question is what portion of people's data is actually >>> > > marked as non-nullable in Parquet files? (We already treat json, avro, >>> > > kudu, and hbase (except row key) as nullable. We do treat csv as >>> > > non-nullable (array) but I think these workloads start with conversion >>> to >>> > > Parquet.) Early on, we typically benchmarked Drill using required >>> fields >>> > > in Parquet. At the time, we actually hacked the Pig code to get >>> something >>> > > to even generate this format. (I believe, to this day, Pig only >>> generates >>> > > nullable fields in Parquet.) After some time, we recognized that >>> > basically >>> > > every tool was producing Parquet files that were nullable and >>> ultimately >>> > > moved the benchmark infrastructure to using nullable types to do a >>> better >>> > > job of representing real-world workloads. >>> > > >>> > > Based on my (fuzzy) recollection, working with nullable types had a >>> > 10-15% >>> > > performance impact versus working on required types so I think there >>> is a >>> > > performance impact but I think the population of users who have >>> > > non-nullable Parquet files are small. If I recall, I believe Impala >>> also >>> > > creates nullable Parquet files. Not sure what Spark does. I believe >>> Hive >>> > > has also made this change recently or is doing it (deprecating >>> non-nulls >>> > in >>> > > their internals). >>> > > >>> > > If we move forward with this, I would expect there initially would be a >>> > > decrease in performance when data is held as non-nullable given we >>> > > previously observed this. However, I believe the reduction in code >>> > > complexity would leads us to improve other things more quickly. Which >>> > leads >>> > > me to... >>> > > >>> > > Re: Why >>> > > >>> > > Drill suffers from code complexity. This hurts forward progress. One >>> > > example is the fact that we have to generate all nullable permutations >>> of >>> > > functions. (For example, if we have three arguments, we have to >>> generate >>> > 8 >>> > > separate functions to work with the combination of argument >>> > nullabilities). >>> > > This leads to vastly more reliance on compile-time templating which is >>> a >>> > > maintenance headache. Additionally, it makes the runtime code >>> generation >>> > > more complicated and error prone. Testing is also more expensive >>> because >>> > we >>> > > now have twice as many paths to both validate and maintain. >>> > Realistically, >>> > > we should try to move to more columnar algorithms, which would provide >>> a >>> > > bigger lift than this declared schema nullability optimization. This is >>> > > because many workloads have rare nulls so we can actually optimize >>> better >>> > > at the batch level. Creating three code paths (nullable observed >>> > non-null, >>> > > nullable observed null and non-null) make this substantially more >>> > > complicated. We want to invest in this area but the code complexity of >>> > > nullable versus required makes this tasks less likely to happen/harder. >>> > So, >>> > > in essence, I'm arguing that it is a small short-term loss that leads >>> to >>> > > better code quality and ultimately faster performance. >>> > > >>> > > Do others have real-world observations on the frequency of required >>> > fields >>> > > in Parquet files? >>> > > >>> > > thanks, >>> > > Jacques >>> > > >>> > > >>> > > >>> > > -- >>> > > Jacques Nadeau >>> > > CTO and Co-Founder, Dremio >>> > > >>> > > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <par...@apache.org> >>> > wrote: >>> > > >>> > > > I'm not entirely convinced that this would have no performance >>> impact. >>> > Do >>> > > > we have any experiments? >>> > > > >>> > > > >>> > > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <jacq...@dremio.com> >>> > > > wrote: >>> > > > >>> > > > > My suggestion is we use explicit observation at the batch level. If >>> > there >>> > > > > are no nulls we can optimize this batch. This would ultimately >>> > improve >>> > > > over >>> > > > > our current situation where most parquet and all json data is >>> > nullable so >>> > > > > we don't optimize. I'd estimate that the vast majority of Drills >>> > > > workloads >>> > > > > are marked nullable whether they are or not. So what we're really >>> > > > > suggesting is deleting a bunch of code which is rarely in the >>> > execution >>> > > > > path. >>> > > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <amansi...@apache.org> >>> wrote: >>> > > > > >>> > > > > > I was thinking about it more after sending the previous concerns. >>> > > > Agree, >>> > > > > > this is an execution side change...but some details need to be >>> > worked >>> > > > > out. >>> > > > > > If the planner indicates to the executor that a column is >>> > non-nullable >>> > > > > (e.g >>> > > > > > a primary key), the run-time generated code is more efficient >>> > since it >>> > > > > > does not have to check the null bit. Are you thinking we would >>> use >>> > the >>> > > > > > existing nullable vector and add some additional metadata (at a >>> > record >>> > > > > > batch level rather than record level) to indicate >>> non-nullability ? >>> > > > > > >>> > > > > > >>> > > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau < >>> > jacq...@dremio.com >>> > > >>> > > > > > wrote: >>> > > > > > >>> > > > > > > Hey Aman, I believe both Steven and I were only suggesting >>> > removal >>> > > > only >>> > > > > > > from execution, not planning. It seems like your concerns are >>> all >>> > > > > related >>> > > > > > > to planning. Iit seems like the real tradeoffs in execution are >>> > > > > nominal. >>> > > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <amansi...@apache.org> >>> > wrote: >>> > > > > > > >>> > > > > > > > While it is true that there is code complexity due to the >>> > required >>> > > > > > type, >>> > > > > > > > what would we be trading off ? some important >>> considerations: >>> > > > > > > > - We don't currently have null count statistics which would >>> > need >>> > > > to >>> > > > > > be >>> > > > > > > > implemented for various data sources >>> > > > > > > > - Primary keys in the RDBMS sources (or rowkeys in hbase) >>> are >>> > > > > always >>> > > > > > > > non-null, and although today we may not be doing >>> optimizations >>> > to >>> > > > > > > leverage >>> > > > > > > > that, one could easily add a rule that converts WHERE >>> > primary_key >>> > > > > IS >>> > > > > > > NULL >>> > > > > > > > to a FALSE filter. >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky < >>> > > > > > doshin...@commvault.com> >>> > > > > > > > wrote: >>> > > > > > > > >>> > > > > > > > > Hi Jacques, >>> > > > > > > > > Marginally related to this, I made a small change in PR-372 >>> > > > > > > (DRILL-4184) >>> > > > > > > > > to support variable widths for decimal quantities in >>> Parquet. >>> > I >>> > > > > > found >>> > > > > > > > the >>> > > > > > > > > (decimal) vectoring code to be very difficult to understand >>> > > > > (probably >>> > > > > > > > > because it's overly complex, but also because I'm new to >>> > Drill >>> > > > code >>> > > > > > in >>> > > > > > > > > general), so I made a small, surgical change in my pull >>> > request >>> > > > to >>> > > > > > > > support >>> > > > > > > > > keeping track of variable widths (lengths) and null >>> booleans >>> > > > within >>> > > > > > the >>> > > > > > > > > existing fixed width decimal vectoring scheme. Can my >>> > changes be >>> > > > > > > > > reviewed/accepted, and then we discuss how to fix properly >>> > > > > long-term? >>> > > > > > > > > >>> > > > > > > > > Thanks, >>> > > > > > > > > Dave Oshinsky >>> > > > > > > > > >>> > > > > > > > > -----Original Message----- >>> > > > > > > > > From: Jacques Nadeau [mailto:jacq...@dremio.com] >>> > > > > > > > > Sent: Monday, March 21, 2016 11:43 PM >>> > > > > > > > > To: dev >>> > > > > > > > > Subject: Re: [DISCUSS] Remove required type >>> > > > > > > > > >>> > > > > > > > > Definitely in support of this. The required type is a huge >>> > > > > > maintenance >>> > > > > > > > and >>> > > > > > > > > code complexity nightmare that provides little to no >>> benefit. >>> > As >>> > > > > you >>> > > > > > > > point >>> > > > > > > > > out, we can do better performance optimizations though null >>> > count >>> > > > > > > > > observation since most sources are nullable anyway. >>> > > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" < >>> > ste...@dremio.com> >>> > > > > > wrote: >>> > > > > > > > > >>> > > > > > > > > > I have been thinking about this for a while now, and I >>> feel >>> > it >>> > > > > > would >>> > > > > > > > > > be a good idea to remove the Required vector types from >>> > Drill, >>> > > > > and >>> > > > > > > > > > only use the Nullable version of vectors. I think this >>> will >>> > > > > greatly >>> > > > > > > > > simplify the code. >>> > > > > > > > > > It will also simplify the creation of UDFs. As is, if a >>> > > > function >>> > > > > > has >>> > > > > > > > > > custom null handling (i.e. INTERNAL), the function has to >>> > be >>> > > > > > > > > > separately implemented for each permutation of >>> nullability >>> > of >>> > > > the >>> > > > > > > > > > inputs. But if drill data types are always nullable, this >>> > > > > wouldn't >>> > > > > > > be a >>> > > > > > > > > problem. >>> > > > > > > > > > >>> > > > > > > > > > I don't think there would be much impact on performance. >>> In >>> > > > > > practice, >>> > > > > > > > > > I think the required type is used very rarely. And there >>> > are >>> > > > > other >>> > > > > > > > > > ways we can optimize for when a column is known to have >>> no >>> > > > nulls. >>> > > > > > > > > > >>> > > > > > > > > > Thoughts? >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > ***************************Legal >>> > > > > > Disclaimer*************************** >>> > > > > > > > > "This communication may contain confidential and privileged >>> > > > > material >>> > > > > > > for >>> > > > > > > > > the >>> > > > > > > > > sole use of the intended recipient. Any unauthorized >>> review, >>> > use >>> > > > or >>> > > > > > > > > distribution >>> > > > > > > > > by others is strictly prohibited. If you have received the >>> > > > message >>> > > > > by >>> > > > > > > > > mistake, >>> > > > > > > > > please advise the sender by reply email and delete the >>> > message. >>> > > > > Thank >>> > > > > > > > you." >>> > > > > > > > > >>> > > > > > >>> > ********************************************************************** >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > >>>