Agree with Zelaine, plan changes/optimizations shouldn't change results. This is a bug.
Drill is focused on being case-insensitive, case-preserving. Each storage plugin implements its own case sensitivity policy when working with columns/fields and should be documented. It isn't practical to make HBase case-insensitive so it should behave case sensitivity. DFS formats (as opposed to HBase) are entirely under Drill's control and thus target case-insensitive, case-preserving operation. -- Jacques Nadeau CTO and Co-Founder, Dremio On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote: > Abhishek > > Great question. Here is what I understand regarding the case sensitive > policy. > > Drill's case sensitivity policy (case insensitive and case preserving) > applies to the execution engine in Drill; it does not enforce the case > sensitivity policy to all the storage plugin. A storage plugin could > decide and implement it's own policy. > > Why would the pushdown impact the case sensitivity when query HBase? > Without project pushdown, HBase storage plugin will return all the > data, and it's up to Drill's execution Project operator to apply the > case insensitive policy. With the project pushdown, Drill will pass > the list of column names to HBase storage plugin, and HBase decides to > apply it's case sensitivity policy when scan the data. > > Adding an option to make case sensitive storage plugin honor case > insensitive policy seems to be a good idea. The question is whether > the underneath storage (like HBase) will support such mode. > > > > > > > On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <zf...@maprtech.com> wrote: > > Abhishek, > > > > I guess you're arguing that Drill's current behavior of honoring the case > > sensitive nature of the underlying data source (in this case, HBase and > > MapR-DB) will be confusing for Drill users who are accustomed to Drill's > > case insensitive behavior. > > > > I can see arguments both ways. > > > > But the part I think is confusing is that the behavior differs depending > on > > whether or not projections and filters are pushed down to the data > source. > > If the push down is done, then the behavior is case sensitive > > (corresponding to the data source). But if pushdown doesn't happen, then > > the behavior is case insensitive. That difference seems inconsistent and > > undesirable -- unless you argue that there are instances where you would > > want one behavior vs the other. But it seems like that should be > > orthogonal and separate from whether pushdowns are applied. > > > > -- Zelaine > > > > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <agir...@mapr.com> > wrote: > > > >> Hello all, > >> > >> As I understand, Drill by design is case-insensitive, w.r.t column names > >> within a table or file [1]. While this provides great flexibility and > works > >> well with many data-sources, there are issues when working with > >> case-sensitive data-sources such as HBase / MapR-DB. > >> > >> Consider the following JSON file: > >> > >> {"_id": "ID1", > >> *"Name"* : "ABC", > >> "Age" : "25", > >> "Phone" : null > >> } > >> {"_id": "ID2", > >> *"name"* : "PQR", > >> "Age" : "30", > >> "Phone" : "408-123-456" > >> } > >> {"_id": "ID3", > >> *"NAME"* : "XYZ", > >> "Phone" : "" > >> } > >> > >> Note that the case of the name field within the JSON file is of > mixed-case. > >> > >> From Drill, while querying the JSON file directly (or corresponding > content > >> in Parquet or Text formats), we get results which we as Drill users have > >> come to expect: > >> > >> > select NAME from mfs.`/tmp/json/a.json`; > >> +-------+ > >> | NAME | > >> +-------+ > >> | ABC | > >> | PQR | > >> | XYZ | > >> +-------+ > >> > >> > >> However, while querying a case-sensitive datasource (*with pushdown > >> enabled*) > >> the following results are returned. The case provided in the query text > is > >> honored and would determine the results. This could come as a *slight > >> surprise to certain Drill users* exploring/migrating to new Databases > >> (using new Storage / Format plugins within Drill) > >> > >> > select *Name* from mfs.`/tmp/json/a`; > >> +-------+ > >> | Name | > >> +-------+ > >> | ABC | > >> +-------+ > >> > >> > select *name* from mfs.`/tmp/json/a`; > >> +-------+ > >> | name | > >> +-------+ > >> | PQR | > >> +-------+ > >> > >> > select *NAME* from mfs.`/tmp/json/a`; > >> +-------+ > >> | NAME | > >> +-------+ > >> | XYZ | > >> +-------+ > >> > >> > >> > select *nAME* from mfs.`/tmp/json/a`; > >> +-------+ > >> | nAME | > >> +-------+ > >> +-------+ > >> No rows selected > >> > >> There is no easy way to get all matching rows (irrespective of the case > of > >> the column name). In the above example, the first row matching the > provided > >> case is returned. > >> > >> > >> > select *Name, name, NAME* from mfs.`/tmp/json/a`; > >> +-------+--------+--------+ > >> | Name | name0 | NAME1 | > >> +-------+--------+--------+ > >> | ABC | ABC | ABC | > >> +-------+--------+--------+ > >> > >> > select *NAME, Name, name* from mfs.`/tmp/json/a`; > >> +-------+--------+--------+ > >> | NAME | Name0 | name1 | > >> +-------+--------+--------+ > >> | XYZ | XYZ | XYZ | > >> +-------+--------+--------+ > >> > >> > >> If Pushdown features are disabled, the behavior seen above would indeed > >> match JSON files. However, this could come at a cost of not fully > utilizing > >> the power of the underlying data-source, and could lead to performance > >> issues. > >> > >> *In-consistent Results can happen when:* > >> > >> (1) Dataset has mixed-cases for fields. Example seen above. While this > >> might not be very common, the concerns are still valid*, *since > substantial > >> Drill users are exploring Drill for ETL cases where Data is not > completely > >> sanitized. > >> > >> (2) Data is consistent w.r.t case, but the query text has non-matching > >> case. While some could term this as user error, it could still cause > issues > >> when users, applications or the underlying datasources change. > >> > >> In both the above cases, Drill would silently perform the query and > return > >> results which could be either *none, partial, complete/correct or > entirely* > >> *wrong*. > >> > >> Some specific questions: > >> > >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive > Data-sources. > >> *For users who prefer the flexibility, how can Drill ensure that the > >> underlying data-source can return case-insensitive results. > >> > >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY > support > >> case-sensitive behavior for data-sources. Users coming from > case-sensitive > >> databases might want results matching the provided case. Example using > the > >> above data: > >> > >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`; > >> +------+-------+--------+--------+ > >> > >> | _id | Name | name | NAME | > >> +------+-------+--------+--------+ > >> | ID1 | ABC | null | null | > >> +------+-------+--------+--------+ > >> | ID2 | null | PQR | null | > >> +------+-------+--------+--------+ > >> | ID3 | null | null | XYZ | > >> +------+-------+--------+--------+ > >> > >> > >> (3) How does Drill currently work with *MongoDB*, which i guess is a > >> case-sensitive database? Have these issues ever been discussed > previously? > >> > >> > >> Thanks in advance. I'd appreciate any helpful response. > >> > >> Regards, > >> Abhishek > >> > >> > >> [1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity > >> >