While it is true that there is code complexity due to the required type, what would we be trading off ? some important considerations: - We don't currently have null count statistics which would need to be implemented for various data sources - Primary keys in the RDBMS sources (or rowkeys in hbase) are always non-null, and although today we may not be doing optimizations to leverage that, one could easily add a rule that converts WHERE primary_key IS NULL to a FALSE filter.
On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <[email protected]> wrote: > Hi Jacques, > Marginally related to this, I made a small change in PR-372 (DRILL-4184) > to support variable widths for decimal quantities in Parquet. I found the > (decimal) vectoring code to be very difficult to understand (probably > because it's overly complex, but also because I'm new to Drill code in > general), so I made a small, surgical change in my pull request to support > keeping track of variable widths (lengths) and null booleans within the > existing fixed width decimal vectoring scheme. Can my changes be > reviewed/accepted, and then we discuss how to fix properly long-term? > > Thanks, > Dave Oshinsky > > -----Original Message----- > From: Jacques Nadeau [mailto:[email protected]] > Sent: Monday, March 21, 2016 11:43 PM > To: dev > Subject: Re: [DISCUSS] Remove required type > > Definitely in support of this. The required type is a huge maintenance and > code complexity nightmare that provides little to no benefit. As you point > out, we can do better performance optimizations though null count > observation since most sources are nullable anyway. > On Mar 21, 2016 7:41 PM, "Steven Phillips" <[email protected]> wrote: > > > I have been thinking about this for a while now, and I feel it would > > be a good idea to remove the Required vector types from Drill, and > > only use the Nullable version of vectors. I think this will greatly > simplify the code. > > It will also simplify the creation of UDFs. As is, if a function has > > custom null handling (i.e. INTERNAL), the function has to be > > separately implemented for each permutation of nullability of the > > inputs. But if drill data types are always nullable, this wouldn't be a > problem. > > > > I don't think there would be much impact on performance. In practice, > > I think the required type is used very rarely. And there are other > > ways we can optimize for when a column is known to have no nulls. > > > > Thoughts? > > > > > > ***************************Legal Disclaimer*************************** > "This communication may contain confidential and privileged material for > the > sole use of the intended recipient. Any unauthorized review, use or > distribution > by others is strictly prohibited. If you have received the message by > mistake, > please advise the sender by reply email and delete the message. Thank you." > **********************************************************************
