Hi Chenliang! I have to dig into this again, it was a while back. I think (vaguely) the reason why this worked was that in the end (at the root of a tree that is a logical expression) if the result is UNKNOWN, it is treated like FALSE.
For example a predicate like "WHERE t.a > 10 && t.b == 'pending' ". If one boolean atom is UNKNOWN, the other is TRUE, the whole term becomes UNKNOWN and the row is filtered out (as if the predicate was false) - the result of the query contains no rows where predicate results are UNKNOWN. Stephan On Tue, Dec 1, 2015 at 4:09 AM, Li, Chengxiang <chengxiang...@intel.com> wrote: > Stephen, > For the 3rd topic, you mentioned that "If the boolean expressions are > monotonous (have no NOT), then the UNKNOWN value can be the same as FALSE > ", as UNKNOWN means it could be TRUE as well, does it a proper way to > handle it just as FALSE? > > Aljoscha, > I agree with you, Table can only be transformed from Tuple/Case Class > DataSet now, and Tuple/Case Class does not allow null field value, so read > files from data source to Row DataSet is necessary for NULL value handling. > > -----Original Message----- > From: Aljoscha Krettek [mailto:aljos...@apache.org] > Sent: Friday, November 27, 2015 6:41 PM > To: dev@flink.apache.org > Subject: Re: The null in Flink > > Oh, this is probably the Jira for what I mentioned: > https://issues.apache.org/jira/browse/FLINK-2988 > > > On 27 Nov 2015, at 11:02, Aljoscha Krettek <aljos...@apache.org> wrote: > > > > Hi, > > just some information. The Table API code generator already has > preliminary support for generating code that is NULL-aware. So for example > if you have expressions like 1 + NULL the result would also be null. > > > > I think one of the missing pieces is a way to get data that contains > null values into the system. For example, right now the expected way to > read csv files is via tuples and they don’t support null values. I think we > need a way to directly read CSV files into a Row DataSet (or Table). > > > > Cheers, > > Aljoscha > >> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote: > >> > >> Hi! > >> > >> Thanks for the good discussion! Here are some thoughts from my side: > >> > >> 1) > >> I would agree with Chengxiang that it helps to have as much NULL > >> handling in the table API as possible, since most SQL constructs will > >> be permitted there are well. > >> > >> 2) > >> A question that I have is whether we want to actually follow the SQL > >> standard exactly. There is a lot of criticism on NULL in the SQL > >> standard, and there have been many good proposals for more meaningful > >> semantics (for example differentiate between the meanings "value > >> missing", "value unknown", "value not applicable", etc). > >> > >> Going with the SQL way is easiest and makes SQL addition on top of > >> the table API much easier. Also, there is only one type of NULL, > >> meaning that null-values can be encoded efficiently in bitmaps. > >> Further more, the fact that the Table API users have the power of a > >> programming language at hand (rather than the limited set of SQL > >> operators), they should be able to easily define their own constants > >> for special meanings like "value not applicable" or so. > >> > >> Just curious if anyone has experience with some of the other > >> null-sematic proposals that have been around. > >> > >> 3) > >> One comment concerning the three-value-logic for boolean expressions: > >> > >> A while back, I worked on a SQL engine, and we were able to not > >> implement three-value logic with trick. If I recall correctly, it was > like this: > >> > >> If the boolean expressions are monotonous (have no NOT), then the > >> UNKNOWN value can be the same as FALSE. So the query planner had to > >> rewrite all expression trees to have no NOT, which means pushing the > >> NOT down into the leaf comparison operations (for example push NOT into > == to become !=). > >> These leaf comparison operators needed to be NULL aware to return > >> FALSE on comparisons with NULL. > >> > >> > >> Greetings, > >> Stephan > >> > >> > >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang > >> <chengxiang...@intel.com> > >> wrote: > >> > >>> Thanks, Timo. > >>> We may put the NULL related function support to SQL API, but for > >>> Scalar expression and Boolean expression, it already been supported > >>> in Table API, without NULL value handling support, query with Scalar > >>> expression and Boolean expression would fail while encounter NULL > value. > >>> > >>> Thanks > >>> Chengxiang > >>> > >>> -----Original Message----- > >>> From: Timo Walther [mailto:twal...@apache.org] > >>> Sent: Wednesday, November 25, 2015 7:33 PM > >>> To: dev@flink.apache.org > >>> Subject: Re: The null in Flink > >>> > >>> Hi Chengxiang, > >>> > >>> I totally agree that the Table API should fully support NULL values. > >>> The Table API is a logical API and therefore we should be as close > >>> to ANSI SQL as possible. Rows need to be nullable in the near future. > >>> > >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to > >>> much like SQL magic. I think all other SQL magic (DBMS specific > >>> corner cases) should be handled by the SQL API on top of the Table API. > >>> > >>> Regards, > >>> Timo > >>> > >>> > >>> On 25.11.2015 11:31, Li, Chengxiang wrote: > >>>> Hi > >>>> In this mail list, there are some discussions about null value > >>>> handling > >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203, > >>> FLINK-2210), but unfortunately, got reverted due to immature design, > >>> and no further action since then. I would like to pick this topic up > >>> here, as it's quite an important part of data analysis and many > features depend on it. > >>> Hopefully, through a plenary discussion, we can generate an > >>> acceptable solution and move forward. Stephan has explained very > >>> clearly about how and why Flink handle "Null values in the > >>> Programming Language APIs", so I mainly talk about the second part > >>> of "Null values in the high-level > >>> (logical) APIs ". > >>>> > >>>> 1. Why should Flink support Null values handling in Table API? > >>>> i. Data source may miss column value in many cases, if no Null > >>> values handling in Table API, user need to write an extra ETL to > >>> handle missing values manually. > >>>> ii. Some Table API operators generate Null values on their own, > >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values > >>> handling in Table API is the prerequisite of these features. > >>>> > >>>> 2. The semantic of Null value handling in Table API. > >>>> Fortunately, there are already mature DBMS standards we can follow > >>>> for > >>> Null value handling, I list several semantic of Null value handling > here. > >>> To be noted that, this may not cover all the cases, and the > >>> semantics may vary in different DBMSs, so it should totally open to > discuss. > >>>> I, NULL compare. In ascending order, NULL is smaller than any > >>> other value, and NULL == NULL return false. > >>>> ii. NULL exists in GroupBy Key, all NULL values are grouped as > >>>> a > >>> single group. > >>>> iii. NULL exists in Aggregate columns, ignore NULL in > >>>> aggregation > >>> function. > >>>> iv. NULL exists in both side Join key, refer to #i, > >>> NULL == NULL return false, no output for NULL Join key. > >>>> v. NULL in Scalar expression, expression within > >>> NULL(eg. 1 + NULL) return NULL. > >>>> vi. NULL in Boolean expression, add an extra result: > >>> UNKNOWN, more semantic for Boolean expression in reference #1. > >>>> vii. More related function support, like COALESCE, > >>>> NVL, > >>> NANVL, and so on. > >>>> > >>>> 3. NULL value storage in Table API. > >>>> Just set null to Row field value. To mark NULL value in serialized > >>> binary record data, normally it use extra flag for each field to > >>> mark whether its value is NULL, which would change the data layout > >>> of Row object. So any logic that access serialized Row data directly > >>> should updated to sync with new data layout, for example, many > >>> methods in RowComparator. > >>>> > >>>> Reference: > >>>> 1. Nulls: Nothing to worry about: > >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-0 > >>> 97727.html > >>> . > >>>> 2. Null related functions: > >>>> https://oracle-base.com/articles/misc/null-related-functions > >>>> > >>>> -----Original Message----- > >>>> From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On > >>>> Behalf Of Stephan Ewen > >>>> Sent: Thursday, June 18, 2015 8:43 AM > >>>> To: dev@flink.apache.org > >>>> Subject: Re: The null in Flink > >>>> > >>>> Hi! > >>>> > >>>> I think we actually have two discussions here, both of them important: > >>>> > >>>> -------------------------------------------------------------- > >>>> 1) Null values in the Programming Language APIs > >>>> -------------------------------------------------------------- > >>>> > >>>> Fields in composite types may simply be null pointers. > >>>> > >>>> In object types: > >>>> - primitives members are naturally non-nullable > >>>> - all other members are nullable > >>>> > >>>> => If you want to avoid the overhead of nullability, go with > >>>> primitive > >>> types. > >>>> > >>>> In Tuples, and derives types (Scala case classes): > >>>> - Fields are non-nullable. > >>>> > >>>> => The reason here is that we initially decided to keep tuples as a > >>>> very > >>> fast data type. Because tuples cannot hold primitives in Java/Scala, > >>> we would not have a way to make fast non-nullable fields. The > >>> performance of nullable fields affects the key-operations, especially > on normalized keys. > >>>> We can work around that with some effort, but have not one it so far. > >>>> > >>>> => In Scala, the Option types is a natural way of elegantly working > >>> around that. > >>>> > >>>> > >>>> -------------------------------------------------------------- > >>>> 2) Null values in the high-level (logial) APIs > >>>> -------------------------------------------------------------- > >>>> > >>>> This is mainly what Ted was referring to, if I understood him > correctly. > >>>> > >>>> Here, we need to figure out what form of semantical null values in > >>>> the > >>> Table API and later, in SQL. > >>>> > >>>> Besides deciding what semantics to follow here in the logical APIs, > >>>> we > >>> need to decide what these values confert to/from when switching > >>> between logical/physical APIs. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning > >>>> <ted.dunn...@gmail.com> > >>> wrote: > >>>> > >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels > >>>>> <m...@apache.org> > >>>>> wrote: > >>>>> > >>>>>> Just to give an idea what null values could cause in Flink: > >>>>> DataSet.count() > >>>>>> returns the number of elements of all values in a Dataset (null > >>>>>> or > >>>>>> not) while #834 would ignore null values and aggregate the > >>>>>> DataSet without > >>>>> them. > >>>>> Compare R's na.action. > >>>>> > >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm > >>>>> > >>> > >>> > > > >