Re: Working with Case-Sensitive Data-sources

Jacques Nadeau Mon, 14 Mar 2016 14:56:00 -0700

Agree with Zelaine, plan changes/optimizations shouldn't change results.
This is a bug.


Drill is focused on being case-insensitive, case-preserving. Each storage
plugin implements its own case sensitivity policy when working with
columns/fields and should be documented. It isn't practical to make HBase
case-insensitive so it should behave case sensitivity. DFS formats (as
opposed to HBase) are entirely under Drill's control and thus target
case-insensitive, case-preserving operation.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:

> Abhishek
>
> Great question. Here is what I understand regarding the case sensitive
> policy.
>
> Drill's case sensitivity policy (case insensitive and case preserving)
> applies to the execution engine in Drill; it does not enforce the case
> sensitivity policy to all the storage plugin. A storage plugin could
> decide and implement it's own policy.
>
> Why would the pushdown impact the case sensitivity when query HBase?
> Without project pushdown, HBase storage plugin will return all the
> data, and it's up to Drill's execution Project operator to apply the
> case insensitive policy.  With the project pushdown, Drill will pass
> the list of column names to HBase storage plugin, and HBase decides to
> apply it's case sensitivity policy when scan the data.
>
> Adding an option to make case sensitive storage plugin honor case
> insensitive policy seems to be a good idea. The question is whether
> the underneath storage (like HBase) will support such mode.
>
>
>
>
>
>
> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <zf...@maprtech.com> wrote:
> > Abhishek,
> >
> > I guess you're arguing that Drill's current behavior of honoring the case
> > sensitive nature of the underlying data source (in this case, HBase and
> > MapR-DB) will be confusing for Drill users who are accustomed to  Drill's
> > case insensitive behavior.
> >
> > I can see arguments both ways.
> >
> > But the part I think is confusing is that the behavior differs depending
> on
> > whether or not projections and filters are pushed down to the data
> source.
> > If the push down is done, then the behavior is case sensitive
> > (corresponding to the data source).  But if pushdown doesn't happen, then
> > the behavior is case insensitive.  That difference seems inconsistent and
> > undesirable -- unless you argue that there are instances where you would
> > want one behavior vs the other.  But it seems like that should be
> > orthogonal and separate from whether pushdowns are applied.
> >
> > -- Zelaine
> >
> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <agir...@mapr.com>
> wrote:
> >
> >> Hello all,
> >>
> >> As I understand, Drill by design is case-insensitive, w.r.t column names
> >> within a table or file [1]. While this provides great flexibility and
> works
> >> well with many data-sources, there are issues when working with
> >> case-sensitive data-sources such as HBase / MapR-DB.
> >>
> >> Consider the following JSON file:
> >>
> >> {"_id": "ID1",
> >>  *"Name"* : "ABC",
> >>  "Age" : "25",
> >>  "Phone" : null
> >> }
> >> {"_id": "ID2",
> >>  *"name"* : "PQR",
> >>  "Age" : "30",
> >>  "Phone" : "408-123-456"
> >> }
> >> {"_id": "ID3",
> >>  *"NAME"* : "XYZ",
> >>  "Phone" : ""
> >> }
> >>
> >> Note that the case of the name field within the JSON file is of
> mixed-case.
> >>
> >> From Drill, while querying the JSON file directly (or corresponding
> content
> >> in Parquet or Text formats), we get results which we as Drill users have
> >> come to expect:
> >>
> >> > select NAME from mfs.`/tmp/json/a.json`;
> >> +-------+
> >> | NAME  |
> >> +-------+
> >> | ABC   |
> >> | PQR   |
> >> | XYZ   |
> >> +-------+
> >>
> >>
> >> However, while querying a case-sensitive datasource (*with pushdown
> >> enabled*)
> >> the following results are returned. The case provided in the query text
> is
> >> honored and would determine the results. This could come as a *slight
> >> surprise to certain Drill users* exploring/migrating to new Databases
> >> (using new Storage / Format plugins within Drill)
> >>
> >> > select *Name* from mfs.`/tmp/json/a`;
> >> +-------+
> >> | Name  |
> >> +-------+
> >> | ABC   |
> >> +-------+
> >>
> >> > select *name* from mfs.`/tmp/json/a`;
> >> +-------+
> >> | name  |
> >> +-------+
> >> | PQR   |
> >> +-------+
> >>
> >> > select *NAME* from mfs.`/tmp/json/a`;
> >> +-------+
> >> | NAME  |
> >> +-------+
> >> | XYZ   |
> >> +-------+
> >>
> >>
> >> > select *nAME* from mfs.`/tmp/json/a`;
> >> +-------+
> >> | nAME  |
> >> +-------+
> >> +-------+
> >> No rows selected
> >>
> >> There is no easy way to get all matching rows (irrespective of the case
> of
> >> the column name). In the above example, the first row matching the
> provided
> >> case is returned.
> >>
> >>
> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
> >> +-------+--------+--------+
> >> | Name  | name0  | NAME1  |
> >> +-------+--------+--------+
> >> | ABC   | ABC    | ABC    |
> >> +-------+--------+--------+
> >>
> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
> >> +-------+--------+--------+
> >> | NAME  | Name0  | name1  |
> >> +-------+--------+--------+
> >> | XYZ   | XYZ    | XYZ    |
> >> +-------+--------+--------+
> >>
> >>
> >> If Pushdown features are disabled, the behavior seen above would indeed
> >> match JSON files. However, this could come at a cost of not fully
> utilizing
> >> the power of the underlying data-source, and could lead to performance
> >> issues.
> >>
> >> *In-consistent Results can happen when:*
> >>
> >> (1) Dataset has mixed-cases for fields. Example seen above. While this
> >> might not be very common, the concerns are still valid*, *since
> substantial
> >> Drill users are exploring Drill for ETL cases where Data is not
> completely
> >> sanitized.
> >>
> >> (2) Data is consistent w.r.t case, but the query text has non-matching
> >> case. While some could term this as user error, it could still cause
> issues
> >> when users, applications or the underlying datasources change.
> >>
> >> In both the above cases, Drill would silently perform the query and
> return
> >> results which could be either *none, partial, complete/correct or
> entirely*
> >> *wrong*.
> >>
> >> Some specific questions:
> >>
> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive
> Data-sources.
> >> *For users who prefer the flexibility, how can Drill ensure that the
> >> underlying data-source can return case-insensitive results.
> >>
> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY
> support
> >> case-sensitive behavior for data-sources. Users coming from
> case-sensitive
> >> databases might want results matching the provided case. Example using
> the
> >> above data:
> >>
> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
> >> +------+-------+--------+--------+
> >>
> >> | _id  | Name  | name   | NAME   |
> >> +------+-------+--------+--------+
> >> | ID1  | ABC   | null   | null   |
> >> +------+-------+--------+--------+
> >> | ID2  | null  | PQR    | null   |
> >> +------+-------+--------+--------+
> >> | ID3  | null  | null   | XYZ    |
> >> +------+-------+--------+--------+
> >>
> >>
> >> (3) How does Drill currently work with *MongoDB*, which i guess is a
> >> case-sensitive database? Have these issues ever been discussed
> previously?
> >>
> >>
> >> Thanks in advance. I'd appreciate any helpful response.
> >>
> >> Regards,
> >> Abhishek
> >>
> >>
> >> [1] https://drill.apache.org/docs/lexical-structure/#case-sensitivity
> >>
>

Re: Working with Case-Sensitive Data-sources

Reply via email to