I don't think it is that simple since there are some types of things that we can't pushdown that will cause inconsistent results.
For example, assuming that all values of x are positive, the following two queries should return the same result select * from hbase where x = 5 select * from hbase where abs(x) = 5 However, if the field x is sometimes 'x' and sometimes 'X', we're going to different results between the first query and the second. That is why I think we need to guarantee that even when optimization rules fails, we have the same plan meaning. In essence, all plans should be valid. If you get to a place where a rule changes the data, then the original plan was effectively invalid. -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <[email protected]> wrote: > Project pushdown should always happen. If you see project pushdown > does not happen for your HBase query, then it's a bug. > > However, if you submit two physical plans, one with project pushdown, > another one without project pushdown, but they return different > results for HBase query. I'll not call this a bug. > > > > On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]> > wrote: > > Agree with Zelaine, plan changes/optimizations shouldn't change results. > > This is a bug. > > > > Drill is focused on being case-insensitive, case-preserving. Each storage > > plugin implements its own case sensitivity policy when working with > > columns/fields and should be documented. It isn't practical to make HBase > > case-insensitive so it should behave case sensitivity. DFS formats (as > > opposed to HBase) are entirely under Drill's control and thus target > > case-insensitive, case-preserving operation. > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]> > wrote: > > > >> Abhishek > >> > >> Great question. Here is what I understand regarding the case sensitive > >> policy. > >> > >> Drill's case sensitivity policy (case insensitive and case preserving) > >> applies to the execution engine in Drill; it does not enforce the case > >> sensitivity policy to all the storage plugin. A storage plugin could > >> decide and implement it's own policy. > >> > >> Why would the pushdown impact the case sensitivity when query HBase? > >> Without project pushdown, HBase storage plugin will return all the > >> data, and it's up to Drill's execution Project operator to apply the > >> case insensitive policy. With the project pushdown, Drill will pass > >> the list of column names to HBase storage plugin, and HBase decides to > >> apply it's case sensitivity policy when scan the data. > >> > >> Adding an option to make case sensitive storage plugin honor case > >> insensitive policy seems to be a good idea. The question is whether > >> the underneath storage (like HBase) will support such mode. > >> > >> > >> > >> > >> > >> > >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> > wrote: > >> > Abhishek, > >> > > >> > I guess you're arguing that Drill's current behavior of honoring the > case > >> > sensitive nature of the underlying data source (in this case, HBase > and > >> > MapR-DB) will be confusing for Drill users who are accustomed to > Drill's > >> > case insensitive behavior. > >> > > >> > I can see arguments both ways. > >> > > >> > But the part I think is confusing is that the behavior differs > depending > >> on > >> > whether or not projections and filters are pushed down to the data > >> source. > >> > If the push down is done, then the behavior is case sensitive > >> > (corresponding to the data source). But if pushdown doesn't happen, > then > >> > the behavior is case insensitive. That difference seems inconsistent > and > >> > undesirable -- unless you argue that there are instances where you > would > >> > want one behavior vs the other. But it seems like that should be > >> > orthogonal and separate from whether pushdowns are applied. > >> > > >> > -- Zelaine > >> > > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]> > >> wrote: > >> > > >> >> Hello all, > >> >> > >> >> As I understand, Drill by design is case-insensitive, w.r.t column > names > >> >> within a table or file [1]. While this provides great flexibility and > >> works > >> >> well with many data-sources, there are issues when working with > >> >> case-sensitive data-sources such as HBase / MapR-DB. > >> >> > >> >> Consider the following JSON file: > >> >> > >> >> {"_id": "ID1", > >> >> *"Name"* : "ABC", > >> >> "Age" : "25", > >> >> "Phone" : null > >> >> } > >> >> {"_id": "ID2", > >> >> *"name"* : "PQR", > >> >> "Age" : "30", > >> >> "Phone" : "408-123-456" > >> >> } > >> >> {"_id": "ID3", > >> >> *"NAME"* : "XYZ", > >> >> "Phone" : "" > >> >> } > >> >> > >> >> Note that the case of the name field within the JSON file is of > >> mixed-case. > >> >> > >> >> From Drill, while querying the JSON file directly (or corresponding > >> content > >> >> in Parquet or Text formats), we get results which we as Drill users > have > >> >> come to expect: > >> >> > >> >> > select NAME from mfs.`/tmp/json/a.json`; > >> >> +-------+ > >> >> | NAME | > >> >> +-------+ > >> >> | ABC | > >> >> | PQR | > >> >> | XYZ | > >> >> +-------+ > >> >> > >> >> > >> >> However, while querying a case-sensitive datasource (*with pushdown > >> >> enabled*) > >> >> the following results are returned. The case provided in the query > text > >> is > >> >> honored and would determine the results. This could come as a *slight > >> >> surprise to certain Drill users* exploring/migrating to new Databases > >> >> (using new Storage / Format plugins within Drill) > >> >> > >> >> > select *Name* from mfs.`/tmp/json/a`; > >> >> +-------+ > >> >> | Name | > >> >> +-------+ > >> >> | ABC | > >> >> +-------+ > >> >> > >> >> > select *name* from mfs.`/tmp/json/a`; > >> >> +-------+ > >> >> | name | > >> >> +-------+ > >> >> | PQR | > >> >> +-------+ > >> >> > >> >> > select *NAME* from mfs.`/tmp/json/a`; > >> >> +-------+ > >> >> | NAME | > >> >> +-------+ > >> >> | XYZ | > >> >> +-------+ > >> >> > >> >> > >> >> > select *nAME* from mfs.`/tmp/json/a`; > >> >> +-------+ > >> >> | nAME | > >> >> +-------+ > >> >> +-------+ > >> >> No rows selected > >> >> > >> >> There is no easy way to get all matching rows (irrespective of the > case > >> of > >> >> the column name). In the above example, the first row matching the > >> provided > >> >> case is returned. > >> >> > >> >> > >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`; > >> >> +-------+--------+--------+ > >> >> | Name | name0 | NAME1 | > >> >> +-------+--------+--------+ > >> >> | ABC | ABC | ABC | > >> >> +-------+--------+--------+ > >> >> > >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`; > >> >> +-------+--------+--------+ > >> >> | NAME | Name0 | name1 | > >> >> +-------+--------+--------+ > >> >> | XYZ | XYZ | XYZ | > >> >> +-------+--------+--------+ > >> >> > >> >> > >> >> If Pushdown features are disabled, the behavior seen above would > indeed > >> >> match JSON files. However, this could come at a cost of not fully > >> utilizing > >> >> the power of the underlying data-source, and could lead to > performance > >> >> issues. > >> >> > >> >> *In-consistent Results can happen when:* > >> >> > >> >> (1) Dataset has mixed-cases for fields. Example seen above. While > this > >> >> might not be very common, the concerns are still valid*, *since > >> substantial > >> >> Drill users are exploring Drill for ETL cases where Data is not > >> completely > >> >> sanitized. > >> >> > >> >> (2) Data is consistent w.r.t case, but the query text has > non-matching > >> >> case. While some could term this as user error, it could still cause > >> issues > >> >> when users, applications or the underlying datasources change. > >> >> > >> >> In both the above cases, Drill would silently perform the query and > >> return > >> >> results which could be either *none, partial, complete/correct or > >> entirely* > >> >> *wrong*. > >> >> > >> >> Some specific questions: > >> >> > >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive > >> Data-sources. > >> >> *For users who prefer the flexibility, how can Drill ensure that the > >> >> underlying data-source can return case-insensitive results. > >> >> > >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY > >> support > >> >> case-sensitive behavior for data-sources. Users coming from > >> case-sensitive > >> >> databases might want results matching the provided case. Example > using > >> the > >> >> above data: > >> >> > >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`; > >> >> +------+-------+--------+--------+ > >> >> > >> >> | _id | Name | name | NAME | > >> >> +------+-------+--------+--------+ > >> >> | ID1 | ABC | null | null | > >> >> +------+-------+--------+--------+ > >> >> | ID2 | null | PQR | null | > >> >> +------+-------+--------+--------+ > >> >> | ID3 | null | null | XYZ | > >> >> +------+-------+--------+--------+ > >> >> > >> >> > >> >> (3) How does Drill currently work with *MongoDB*, which i guess is a > >> >> case-sensitive database? Have these issues ever been discussed > >> previously? > >> >> > >> >> > >> >> Thanks in advance. I'd appreciate any helpful response. > >> >> > >> >> Regards, > >> >> Abhishek > >> >> > >> >> > >> >> [1] > https://drill.apache.org/docs/lexical-structure/#case-sensitivity > >> >> > >> >
