Re: Working with Case-Sensitive Data-sources

Jinfeng Ni Mon, 14 Mar 2016 15:47:10 -0700

Project pushdown should always happen. If you see project pushdown
does not happen for your HBase query, then it's a bug.


However, if you submit two physical plans, one with project pushdown,
another one without project pushdown, but they return different
results for HBase query. I'll not call this a bug.



On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]> wrote:
> Agree with Zelaine, plan changes/optimizations shouldn't change results.
> This is a bug.
>
> Drill is focused on being case-insensitive, case-preserving. Each storage
> plugin implements its own case sensitivity policy when working with
> columns/fields and should be documented. It isn't practical to make HBase
> case-insensitive so it should behave case sensitivity. DFS formats (as
> opposed to HBase) are entirely under Drill's control and thus target
> case-insensitive, case-preserving operation.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]> wrote:
>
>> Abhishek
>>
>> Great question. Here is what I understand regarding the case sensitive
>> policy.
>>
>> Drill's case sensitivity policy (case insensitive and case preserving)
>> applies to the execution engine in Drill; it does not enforce the case
>> sensitivity policy to all the storage plugin. A storage plugin could
>> decide and implement it's own policy.
>>
>> Why would the pushdown impact the case sensitivity when query HBase?
>> Without project pushdown, HBase storage plugin will return all the
>> data, and it's up to Drill's execution Project operator to apply the
>> case insensitive policy.  With the project pushdown, Drill will pass
>> the list of column names to HBase storage plugin, and HBase decides to
>> apply it's case sensitivity policy when scan the data.
>>
>> Adding an option to make case sensitive storage plugin honor case
>> insensitive policy seems to be a good idea. The question is whether
>> the underneath storage (like HBase) will support such mode.
>>
>>
>>
>>
>>
>>
>> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]> wrote:
>> > Abhishek,
>> >
>> > I guess you're arguing that Drill's current behavior of honoring the case
>> > sensitive nature of the underlying data source (in this case, HBase and
>> > MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
>> > case insensitive behavior.
>> >
>> > I can see arguments both ways.
>> >
>> > But the part I think is confusing is that the behavior differs depending
>> on
>> > whether or not projections and filters are pushed down to the data
>> source.
>> > If the push down is done, then the behavior is case sensitive
>> > (corresponding to the data source).  But if pushdown doesn't happen, then
>> > the behavior is case insensitive.  That difference seems inconsistent and
>> > undesirable -- unless you argue that there are instances where you would
>> > want one behavior vs the other.  But it seems like that should be
>> > orthogonal and separate from whether pushdowns are applied.
>> >
>> > -- Zelaine
>> >
>> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]>
>> wrote:
>> >
>> >> Hello all,
>> >>
>> >> As I understand, Drill by design is case-insensitive, w.r.t column names
>> >> within a table or file [1]. While this provides great flexibility and
>> works
>> >> well with many data-sources, there are issues when working with
>> >> case-sensitive data-sources such as HBase / MapR-DB.
>> >>
>> >> Consider the following JSON file:
>> >>
>> >> {"_id": "ID1",
>> >>  *"Name"* : "ABC",
>> >>  "Age" : "25",
>> >>  "Phone" : null
>> >> }
>> >> {"_id": "ID2",
>> >>  *"name"* : "PQR",
>> >>  "Age" : "30",
>> >>  "Phone" : "408-123-456"
>> >> }
>> >> {"_id": "ID3",
>> >>  *"NAME"* : "XYZ",
>> >>  "Phone" : ""
>> >> }
>> >>
>> >> Note that the case of the name field within the JSON file is of
>> mixed-case.
>> >>
>> >> From Drill, while querying the JSON file directly (or corresponding
>> content
>> >> in Parquet or Text formats), we get results which we as Drill users have
>> >> come to expect:
>> >>
>> >> > select NAME from mfs.`/tmp/json/a.json`;
>> >> +-------+
>> >> | NAME  |
>> >> +-------+
>> >> | ABC   |
>> >> | PQR   |
>> >> | XYZ   |
>> >> +-------+
>> >>
>> >>
>> >> However, while querying a case-sensitive datasource (*with pushdown
>> >> enabled*)
>> >> the following results are returned. The case provided in the query text
>> is
>> >> honored and would determine the results. This could come as a *slight
>> >> surprise to certain Drill users* exploring/migrating to new Databases
>> >> (using new Storage / Format plugins within Drill)
>> >>
>> >> > select *Name* from mfs.`/tmp/json/a`;
>> >> +-------+
>> >> | Name  |
>> >> +-------+
>> >> | ABC   |
>> >> +-------+
>> >>
>> >> > select *name* from mfs.`/tmp/json/a`;
>> >> +-------+
>> >> | name  |
>> >> +-------+
>> >> | PQR   |
>> >> +-------+
>> >>
>> >> > select *NAME* from mfs.`/tmp/json/a`;
>> >> +-------+
>> >> | NAME  |
>> >> +-------+
>> >> | XYZ   |
>> >> +-------+
>> >>
>> >>
>> >> > select *nAME* from mfs.`/tmp/json/a`;
>> >> +-------+
>> >> | nAME  |
>> >> +-------+
>> >> +-------+
>> >> No rows selected
>> >>
>> >> There is no easy way to get all matching rows (irrespective of the case
>> of
>> >> the column name). In the above example, the first row matching the
>> provided
>> >> case is returned.
>> >>
>> >>
>> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
>> >> +-------+--------+--------+
>> >> | Name  | name0  | NAME1  |
>> >> +-------+--------+--------+
>> >> | ABC   | ABC    | ABC    |
>> >> +-------+--------+--------+
>> >>
>> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
>> >> +-------+--------+--------+
>> >> | NAME  | Name0  | name1  |
>> >> +-------+--------+--------+
>> >> | XYZ   | XYZ    | XYZ    |
>> >> +-------+--------+--------+
>> >>
>> >>
>> >> If Pushdown features are disabled, the behavior seen above would indeed
>> >> match JSON files. However, this could come at a cost of not fully
>> utilizing
>> >> the power of the underlying data-source, and could lead to performance
>> >> issues.
>> >>
>> >> *In-consistent Results can happen when:*
>> >>
>> >> (1) Dataset has mixed-cases for fields. Example seen above. While this
>> >> might not be very common, the concerns are still valid*, *since
>> substantial
>> >> Drill users are exploring Drill for ETL cases where Data is not
>> completely
>> >> sanitized.
>> >>
>> >> (2) Data is consistent w.r.t case, but the query text has non-matching
>> >> case. While some could term this as user error, it could still cause
>> issues
>> >> when users, applications or the underlying datasources change.
>> >>
>> >> In both the above cases, Drill would silently perform the query and
>> return
>> >> results which could be either *none, partial, complete/correct or
>> entirely*
>> >> *wrong*.
>> >>
>> >> Some specific questions:
>> >>
>> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive
>> Data-sources.
>> >> *For users who prefer the flexibility, how can Drill ensure that the
>> >> underlying data-source can return case-insensitive results.
>> >>
>> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY
>> support
>> >> case-sensitive behavior for data-sources. Users coming from
>> case-sensitive
>> >> databases might want results matching the provided case. Example using
>> the
>> >> above data:
>> >>
>> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
>> >> +------+-------+--------+--------+
>> >>
>> >> | _id  | Name  | name   | NAME   |
>> >> +------+-------+--------+--------+
>> >> | ID1  | ABC   | null   | null   |
>> >> +------+-------+--------+--------+
>> >> | ID2  | null  | PQR    | null   |
>> >> +------+-------+--------+--------+
>> >> | ID3  | null  | null   | XYZ    |
>> >> +------+-------+--------+--------+
>> >>
>> >>
>> >> (3) How does Drill currently work with *MongoDB*, which i guess is a
>> >> case-sensitive database? Have these issues ever been discussed
>> previously?
>> >>
>> >>
>> >> Thanks in advance. I'd appreciate any helpful response.
>> >>
>> >> Regards,
>> >> Abhishek
>> >>
>> >>
>> >> [1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity
>> >>
>>

Re: Working with Case-Sensitive Data-sources

Reply via email to