Jacques, Doesn't Drill detect the type of each column within each batch? If so, does it (or could it) also detect that a particular column is not null (again, within the batch)?
You may not generate not-null data, but a lot of data is not-null. Let's not be too hasty to dismiss this as a theoretical argument. If there are performance benefits to treating columns as not-null, it might be worthwhile deducing that they are not-null. Julian On Wed, Mar 23, 2016 at 8:36 AM, Jacques Nadeau <[email protected]> wrote: > I agree that we should focus on real benefits versus theories. > > Reduction in code complexity is a real benefit. Performance benefits from > having required types is theoretical. Dot drill files don't exist so they > should have little bearing on this conversation. > > We rarely generate required data. Most tools never generate it. The reason > the question is about actual deployments is that would be a real factor to > counterbalance the drive for code simplification rather than something > theoretical. A theoretical future performance regression shouldn't stop > code improvement. If it did, we wouldn't make any progress. > > What about your own internal benchmark tests. If removing required types > doesn't impact them, doesn't that mean this hasn't been a point of focus? > On Mar 22, 2016 8:36 PM, "Parth Chandra" <[email protected]> wrote: > >> I don't know if the main question is whether people have parquet (or other >> ) files which have required fields or not. With something like a dot drill >> file, a user can supply schema or format for data that does not carry >> schema, and we can certainly use the same to indicate knowledge of >> nullability. The question is whether we can take advantage of knowing >> whether data is null or not to get better performance. >> >> Any argument that applies to taking advantage of non-nullability at the >> batch level applies to taking advantage of non-nullability at the schema >> level. >> >> I'm not entirely convinced that the reduction of code complexity is >> ultimately leading to performance gain. Sure, it improves maintainability, >> but what specific improvements are you thinking of that will increase >> performance? >> >> If you recommend some areas of improvement that become possible as a result >> of this change, then I would suggest we run some experiments before we make >> any change. >> >> It is a capital mistake to theorize before one has data, etc... >> >> A 15% performance drop is not something to be ignored, I would think. >> >> Parth >> >> On Tue, Mar 22, 2016 at 5:40 PM, Jacques Nadeau <[email protected]> >> wrote: >> > >> > Re Performance: >> > >> > I think the main question is what portion of people's data is actually >> > marked as non-nullable in Parquet files? (We already treat json, avro, >> > kudu, and hbase (except row key) as nullable. We do treat csv as >> > non-nullable (array) but I think these workloads start with conversion to >> > Parquet.) Early on, we typically benchmarked Drill using required fields >> > in Parquet. At the time, we actually hacked the Pig code to get something >> > to even generate this format. (I believe, to this day, Pig only generates >> > nullable fields in Parquet.) After some time, we recognized that >> basically >> > every tool was producing Parquet files that were nullable and ultimately >> > moved the benchmark infrastructure to using nullable types to do a better >> > job of representing real-world workloads. >> > >> > Based on my (fuzzy) recollection, working with nullable types had a >> 10-15% >> > performance impact versus working on required types so I think there is a >> > performance impact but I think the population of users who have >> > non-nullable Parquet files are small. If I recall, I believe Impala also >> > creates nullable Parquet files. Not sure what Spark does. I believe Hive >> > has also made this change recently or is doing it (deprecating non-nulls >> in >> > their internals). >> > >> > If we move forward with this, I would expect there initially would be a >> > decrease in performance when data is held as non-nullable given we >> > previously observed this. However, I believe the reduction in code >> > complexity would leads us to improve other things more quickly. Which >> leads >> > me to... >> > >> > Re: Why >> > >> > Drill suffers from code complexity. This hurts forward progress. One >> > example is the fact that we have to generate all nullable permutations of >> > functions. (For example, if we have three arguments, we have to generate >> 8 >> > separate functions to work with the combination of argument >> nullabilities). >> > This leads to vastly more reliance on compile-time templating which is a >> > maintenance headache. Additionally, it makes the runtime code generation >> > more complicated and error prone. Testing is also more expensive because >> we >> > now have twice as many paths to both validate and maintain. >> Realistically, >> > we should try to move to more columnar algorithms, which would provide a >> > bigger lift than this declared schema nullability optimization. This is >> > because many workloads have rare nulls so we can actually optimize better >> > at the batch level. Creating three code paths (nullable observed >> non-null, >> > nullable observed null and non-null) make this substantially more >> > complicated. We want to invest in this area but the code complexity of >> > nullable versus required makes this tasks less likely to happen/harder. >> So, >> > in essence, I'm arguing that it is a small short-term loss that leads to >> > better code quality and ultimately faster performance. >> > >> > Do others have real-world observations on the frequency of required >> fields >> > in Parquet files? >> > >> > thanks, >> > Jacques >> > >> > >> > >> > -- >> > Jacques Nadeau >> > CTO and Co-Founder, Dremio >> > >> > On Tue, Mar 22, 2016 at 3:08 PM, Parth Chandra <[email protected]> >> wrote: >> > >> > > I'm not entirely convinced that this would have no performance impact. >> Do >> > > we have any experiments? >> > > >> > > >> > > On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <[email protected]> >> > > wrote: >> > > >> > > > My suggestion is we use explicit observation at the batch level. If >> there >> > > > are no nulls we can optimize this batch. This would ultimately >> improve >> > > over >> > > > our current situation where most parquet and all json data is >> nullable so >> > > > we don't optimize. I'd estimate that the vast majority of Drills >> > > workloads >> > > > are marked nullable whether they are or not. So what we're really >> > > > suggesting is deleting a bunch of code which is rarely in the >> execution >> > > > path. >> > > > On Mar 22, 2016 1:22 PM, "Aman Sinha" <[email protected]> wrote: >> > > > >> > > > > I was thinking about it more after sending the previous concerns. >> > > Agree, >> > > > > this is an execution side change...but some details need to be >> worked >> > > > out. >> > > > > If the planner indicates to the executor that a column is >> non-nullable >> > > > (e.g >> > > > > a primary key), the run-time generated code is more efficient >> since it >> > > > > does not have to check the null bit. Are you thinking we would use >> the >> > > > > existing nullable vector and add some additional metadata (at a >> record >> > > > > batch level rather than record level) to indicate non-nullability ? >> > > > > >> > > > > >> > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau < >> [email protected] >> > >> > > > > wrote: >> > > > > >> > > > > > Hey Aman, I believe both Steven and I were only suggesting >> removal >> > > only >> > > > > > from execution, not planning. It seems like your concerns are all >> > > > related >> > > > > > to planning. Iit seems like the real tradeoffs in execution are >> > > > nominal. >> > > > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <[email protected]> >> wrote: >> > > > > > >> > > > > > > While it is true that there is code complexity due to the >> required >> > > > > type, >> > > > > > > what would we be trading off ? some important considerations: >> > > > > > > - We don't currently have null count statistics which would >> need >> > > to >> > > > > be >> > > > > > > implemented for various data sources >> > > > > > > - Primary keys in the RDBMS sources (or rowkeys in hbase) are >> > > > always >> > > > > > > non-null, and although today we may not be doing optimizations >> to >> > > > > > leverage >> > > > > > > that, one could easily add a rule that converts WHERE >> primary_key >> > > > IS >> > > > > > NULL >> > > > > > > to a FALSE filter. >> > > > > > > >> > > > > > > >> > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky < >> > > > > [email protected]> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi Jacques, >> > > > > > > > Marginally related to this, I made a small change in PR-372 >> > > > > > (DRILL-4184) >> > > > > > > > to support variable widths for decimal quantities in Parquet. >> I >> > > > > found >> > > > > > > the >> > > > > > > > (decimal) vectoring code to be very difficult to understand >> > > > (probably >> > > > > > > > because it's overly complex, but also because I'm new to >> Drill >> > > code >> > > > > in >> > > > > > > > general), so I made a small, surgical change in my pull >> request >> > > to >> > > > > > > support >> > > > > > > > keeping track of variable widths (lengths) and null booleans >> > > within >> > > > > the >> > > > > > > > existing fixed width decimal vectoring scheme. Can my >> changes be >> > > > > > > > reviewed/accepted, and then we discuss how to fix properly >> > > > long-term? >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Dave Oshinsky >> > > > > > > > >> > > > > > > > -----Original Message----- >> > > > > > > > From: Jacques Nadeau [mailto:[email protected]] >> > > > > > > > Sent: Monday, March 21, 2016 11:43 PM >> > > > > > > > To: dev >> > > > > > > > Subject: Re: [DISCUSS] Remove required type >> > > > > > > > >> > > > > > > > Definitely in support of this. The required type is a huge >> > > > > maintenance >> > > > > > > and >> > > > > > > > code complexity nightmare that provides little to no benefit. >> As >> > > > you >> > > > > > > point >> > > > > > > > out, we can do better performance optimizations though null >> count >> > > > > > > > observation since most sources are nullable anyway. >> > > > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" < >> [email protected]> >> > > > > wrote: >> > > > > > > > >> > > > > > > > > I have been thinking about this for a while now, and I feel >> it >> > > > > would >> > > > > > > > > be a good idea to remove the Required vector types from >> Drill, >> > > > and >> > > > > > > > > only use the Nullable version of vectors. I think this will >> > > > greatly >> > > > > > > > simplify the code. >> > > > > > > > > It will also simplify the creation of UDFs. As is, if a >> > > function >> > > > > has >> > > > > > > > > custom null handling (i.e. INTERNAL), the function has to >> be >> > > > > > > > > separately implemented for each permutation of nullability >> of >> > > the >> > > > > > > > > inputs. But if drill data types are always nullable, this >> > > > wouldn't >> > > > > > be a >> > > > > > > > problem. >> > > > > > > > > >> > > > > > > > > I don't think there would be much impact on performance. In >> > > > > practice, >> > > > > > > > > I think the required type is used very rarely. And there >> are >> > > > other >> > > > > > > > > ways we can optimize for when a column is known to have no >> > > nulls. >> > > > > > > > > >> > > > > > > > > Thoughts? >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > ***************************Legal >> > > > > Disclaimer*************************** >> > > > > > > > "This communication may contain confidential and privileged >> > > > material >> > > > > > for >> > > > > > > > the >> > > > > > > > sole use of the intended recipient. Any unauthorized review, >> use >> > > or >> > > > > > > > distribution >> > > > > > > > by others is strictly prohibited. If you have received the >> > > message >> > > > by >> > > > > > > > mistake, >> > > > > > > > please advise the sender by reply email and delete the >> message. >> > > > Thank >> > > > > > > you." >> > > > > > > > >> > > > > >> ********************************************************************** >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >>
