Project pushdown should always happen. If you see project pushdown does not happen for your HBase query, then it's a bug.
However, if you submit two physical plans, one with project pushdown, another one without project pushdown, but they return different results for HBase query. I'll not call this a bug. On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]> wrote: > Agree with Zelaine, plan changes/optimizations shouldn't change results. > This is a bug. > > Drill is focused on being case-insensitive, case-preserving. Each storage > plugin implements its own case sensitivity policy when working with > columns/fields and should be documented. It isn't practical to make HBase > case-insensitive so it should behave case sensitivity. DFS formats (as > opposed to HBase) are entirely under Drill's control and thus target > case-insensitive, case-preserving operation. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]> wrote: > >> Abhishek >> >> Great question. Here is what I understand regarding the case sensitive >> policy. >> >> Drill's case sensitivity policy (case insensitive and case preserving) >> applies to the execution engine in Drill; it does not enforce the case >> sensitivity policy to all the storage plugin. A storage plugin could >> decide and implement it's own policy. >> >> Why would the pushdown impact the case sensitivity when query HBase? >> Without project pushdown, HBase storage plugin will return all the >> data, and it's up to Drill's execution Project operator to apply the >> case insensitive policy. With the project pushdown, Drill will pass >> the list of column names to HBase storage plugin, and HBase decides to >> apply it's case sensitivity policy when scan the data. >> >> Adding an option to make case sensitive storage plugin honor case >> insensitive policy seems to be a good idea. The question is whether >> the underneath storage (like HBase) will support such mode. >> >> >> >> >> >> >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> wrote: >> > Abhishek, >> > >> > I guess you're arguing that Drill's current behavior of honoring the case >> > sensitive nature of the underlying data source (in this case, HBase and >> > MapR-DB) will be confusing for Drill users who are accustomed to Drill's >> > case insensitive behavior. >> > >> > I can see arguments both ways. >> > >> > But the part I think is confusing is that the behavior differs depending >> on >> > whether or not projections and filters are pushed down to the data >> source. >> > If the push down is done, then the behavior is case sensitive >> > (corresponding to the data source). But if pushdown doesn't happen, then >> > the behavior is case insensitive. That difference seems inconsistent and >> > undesirable -- unless you argue that there are instances where you would >> > want one behavior vs the other. But it seems like that should be >> > orthogonal and separate from whether pushdowns are applied. >> > >> > -- Zelaine >> > >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]> >> wrote: >> > >> >> Hello all, >> >> >> >> As I understand, Drill by design is case-insensitive, w.r.t column names >> >> within a table or file [1]. While this provides great flexibility and >> works >> >> well with many data-sources, there are issues when working with >> >> case-sensitive data-sources such as HBase / MapR-DB. >> >> >> >> Consider the following JSON file: >> >> >> >> {"_id": "ID1", >> >> *"Name"* : "ABC", >> >> "Age" : "25", >> >> "Phone" : null >> >> } >> >> {"_id": "ID2", >> >> *"name"* : "PQR", >> >> "Age" : "30", >> >> "Phone" : "408-123-456" >> >> } >> >> {"_id": "ID3", >> >> *"NAME"* : "XYZ", >> >> "Phone" : "" >> >> } >> >> >> >> Note that the case of the name field within the JSON file is of >> mixed-case. >> >> >> >> From Drill, while querying the JSON file directly (or corresponding >> content >> >> in Parquet or Text formats), we get results which we as Drill users have >> >> come to expect: >> >> >> >> > select NAME from mfs.`/tmp/json/a.json`; >> >> +-------+ >> >> | NAME | >> >> +-------+ >> >> | ABC | >> >> | PQR | >> >> | XYZ | >> >> +-------+ >> >> >> >> >> >> However, while querying a case-sensitive datasource (*with pushdown >> >> enabled*) >> >> the following results are returned. The case provided in the query text >> is >> >> honored and would determine the results. This could come as a *slight >> >> surprise to certain Drill users* exploring/migrating to new Databases >> >> (using new Storage / Format plugins within Drill) >> >> >> >> > select *Name* from mfs.`/tmp/json/a`; >> >> +-------+ >> >> | Name | >> >> +-------+ >> >> | ABC | >> >> +-------+ >> >> >> >> > select *name* from mfs.`/tmp/json/a`; >> >> +-------+ >> >> | name | >> >> +-------+ >> >> | PQR | >> >> +-------+ >> >> >> >> > select *NAME* from mfs.`/tmp/json/a`; >> >> +-------+ >> >> | NAME | >> >> +-------+ >> >> | XYZ | >> >> +-------+ >> >> >> >> >> >> > select *nAME* from mfs.`/tmp/json/a`; >> >> +-------+ >> >> | nAME | >> >> +-------+ >> >> +-------+ >> >> No rows selected >> >> >> >> There is no easy way to get all matching rows (irrespective of the case >> of >> >> the column name). In the above example, the first row matching the >> provided >> >> case is returned. >> >> >> >> >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`; >> >> +-------+--------+--------+ >> >> | Name | name0 | NAME1 | >> >> +-------+--------+--------+ >> >> | ABC | ABC | ABC | >> >> +-------+--------+--------+ >> >> >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`; >> >> +-------+--------+--------+ >> >> | NAME | Name0 | name1 | >> >> +-------+--------+--------+ >> >> | XYZ | XYZ | XYZ | >> >> +-------+--------+--------+ >> >> >> >> >> >> If Pushdown features are disabled, the behavior seen above would indeed >> >> match JSON files. However, this could come at a cost of not fully >> utilizing >> >> the power of the underlying data-source, and could lead to performance >> >> issues. >> >> >> >> *In-consistent Results can happen when:* >> >> >> >> (1) Dataset has mixed-cases for fields. Example seen above. While this >> >> might not be very common, the concerns are still valid*, *since >> substantial >> >> Drill users are exploring Drill for ETL cases where Data is not >> completely >> >> sanitized. >> >> >> >> (2) Data is consistent w.r.t case, but the query text has non-matching >> >> case. While some could term this as user error, it could still cause >> issues >> >> when users, applications or the underlying datasources change. >> >> >> >> In both the above cases, Drill would silently perform the query and >> return >> >> results which could be either *none, partial, complete/correct or >> entirely* >> >> *wrong*. >> >> >> >> Some specific questions: >> >> >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive >> Data-sources. >> >> *For users who prefer the flexibility, how can Drill ensure that the >> >> underlying data-source can return case-insensitive results. >> >> >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY >> support >> >> case-sensitive behavior for data-sources. Users coming from >> case-sensitive >> >> databases might want results matching the provided case. Example using >> the >> >> above data: >> >> >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`; >> >> +------+-------+--------+--------+ >> >> >> >> | _id | Name | name | NAME | >> >> +------+-------+--------+--------+ >> >> | ID1 | ABC | null | null | >> >> +------+-------+--------+--------+ >> >> | ID2 | null | PQR | null | >> >> +------+-------+--------+--------+ >> >> | ID3 | null | null | XYZ | >> >> +------+-------+--------+--------+ >> >> >> >> >> >> (3) How does Drill currently work with *MongoDB*, which i guess is a >> >> case-sensitive database? Have these issues ever been discussed >> previously? >> >> >> >> >> >> Thanks in advance. I'd appreciate any helpful response. >> >> >> >> Regards, >> >> Abhishek >> >> >> >> >> >> [1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity >> >> >>
