Re: Working with Case-Sensitive Data-sources

Jacques Nadeau Mon, 14 Mar 2016 16:28:55 -0700

I don't think it is that simple since there are some types of things that
we can't pushdown that will cause inconsistent results.


For example, assuming that all values of x are positive, the following two
queries should return the same result

select * from hbase where x = 5
select * from hbase where abs(x) = 5

However, if the field x is sometimes 'x' and sometimes 'X', we're going to
different results between the first query and the second. That is why I
think we need to guarantee that even when optimization rules fails, we have
the same plan meaning. In essence, all plans should be valid. If you get to
a place where a rule changes the data, then the original plan was
effectively invalid.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Mar 14, 2016 at 3:46 PM, Jinfeng Ni <[email protected]> wrote:

> Project pushdown should always happen. If you see project pushdown
> does not happen for your HBase query, then it's a bug.
>
> However, if you submit two physical plans, one with project pushdown,
> another one without project pushdown, but they return different
> results for HBase query. I'll not call this a bug.
>
>
>
> On Mon, Mar 14, 2016 at 2:54 PM, Jacques Nadeau <[email protected]>
> wrote:
> > Agree with Zelaine, plan changes/optimizations shouldn't change results.
> > This is a bug.
> >
> > Drill is focused on being case-insensitive, case-preserving. Each storage
> > plugin implements its own case sensitivity policy when working with
> > columns/fields and should be documented. It isn't practical to make HBase
> > case-insensitive so it should behave case sensitivity. DFS formats (as
> > opposed to HBase) are entirely under Drill's control and thus target
> > case-insensitive, case-preserving operation.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Mon, Mar 14, 2016 at 2:43 PM, Jinfeng Ni <[email protected]>
> wrote:
> >
> >> Abhishek
> >>
> >> Great question. Here is what I understand regarding the case sensitive
> >> policy.
> >>
> >> Drill's case sensitivity policy (case insensitive and case preserving)
> >> applies to the execution engine in Drill; it does not enforce the case
> >> sensitivity policy to all the storage plugin. A storage plugin could
> >> decide and implement it's own policy.
> >>
> >> Why would the pushdown impact the case sensitivity when query HBase?
> >> Without project pushdown, HBase storage plugin will return all the
> >> data, and it's up to Drill's execution Project operator to apply the
> >> case insensitive policy.  With the project pushdown, Drill will pass
> >> the list of column names to HBase storage plugin, and HBase decides to
> >> apply it's case sensitivity policy when scan the data.
> >>
> >> Adding an option to make case sensitive storage plugin honor case
> >> insensitive policy seems to be a good idea. The question is whether
> >> the underneath storage (like HBase) will support such mode.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Mar 14, 2016 at 2:09 PM, Zelaine Fong <[email protected]>
> wrote:
> >> > Abhishek,
> >> >
> >> > I guess you're arguing that Drill's current behavior of honoring the
> case
> >> > sensitive nature of the underlying data source (in this case, HBase
> and
> >> > MapR-DB) will be confusing for Drill users who are accustomed to
> Drill's
> >> > case insensitive behavior.
> >> >
> >> > I can see arguments both ways.
> >> >
> >> > But the part I think is confusing is that the behavior differs
> depending
> >> on
> >> > whether or not projections and filters are pushed down to the data
> >> source.
> >> > If the push down is done, then the behavior is case sensitive
> >> > (corresponding to the data source).  But if pushdown doesn't happen,
> then
> >> > the behavior is case insensitive.  That difference seems inconsistent
> and
> >> > undesirable -- unless you argue that there are instances where you
> would
> >> > want one behavior vs the other.  But it seems like that should be
> >> > orthogonal and separate from whether pushdowns are applied.
> >> >
> >> > -- Zelaine
> >> >
> >> > On Mon, Mar 14, 2016 at 1:40 AM, Abhishek Girish <[email protected]>
> >> wrote:
> >> >
> >> >> Hello all,
> >> >>
> >> >> As I understand, Drill by design is case-insensitive, w.r.t column
> names
> >> >> within a table or file [1]. While this provides great flexibility and
> >> works
> >> >> well with many data-sources, there are issues when working with
> >> >> case-sensitive data-sources such as HBase / MapR-DB.
> >> >>
> >> >> Consider the following JSON file:
> >> >>
> >> >> {"_id": "ID1",
> >> >>  *"Name"* : "ABC",
> >> >>  "Age" : "25",
> >> >>  "Phone" : null
> >> >> }
> >> >> {"_id": "ID2",
> >> >>  *"name"* : "PQR",
> >> >>  "Age" : "30",
> >> >>  "Phone" : "408-123-456"
> >> >> }
> >> >> {"_id": "ID3",
> >> >>  *"NAME"* : "XYZ",
> >> >>  "Phone" : ""
> >> >> }
> >> >>
> >> >> Note that the case of the name field within the JSON file is of
> >> mixed-case.
> >> >>
> >> >> From Drill, while querying the JSON file directly (or corresponding
> >> content
> >> >> in Parquet or Text formats), we get results which we as Drill users
> have
> >> >> come to expect:
> >> >>
> >> >> > select NAME from mfs.`/tmp/json/a.json`;
> >> >> +-------+
> >> >> | NAME  |
> >> >> +-------+
> >> >> | ABC   |
> >> >> | PQR   |
> >> >> | XYZ   |
> >> >> +-------+
> >> >>
> >> >>
> >> >> However, while querying a case-sensitive datasource (*with pushdown
> >> >> enabled*)
> >> >> the following results are returned. The case provided in the query
> text
> >> is
> >> >> honored and would determine the results. This could come as a *slight
> >> >> surprise to certain Drill users* exploring/migrating to new Databases
> >> >> (using new Storage / Format plugins within Drill)
> >> >>
> >> >> > select *Name* from mfs.`/tmp/json/a`;
> >> >> +-------+
> >> >> | Name  |
> >> >> +-------+
> >> >> | ABC   |
> >> >> +-------+
> >> >>
> >> >> > select *name* from mfs.`/tmp/json/a`;
> >> >> +-------+
> >> >> | name  |
> >> >> +-------+
> >> >> | PQR   |
> >> >> +-------+
> >> >>
> >> >> > select *NAME* from mfs.`/tmp/json/a`;
> >> >> +-------+
> >> >> | NAME  |
> >> >> +-------+
> >> >> | XYZ   |
> >> >> +-------+
> >> >>
> >> >>
> >> >> > select *nAME* from mfs.`/tmp/json/a`;
> >> >> +-------+
> >> >> | nAME  |
> >> >> +-------+
> >> >> +-------+
> >> >> No rows selected
> >> >>
> >> >> There is no easy way to get all matching rows (irrespective of the
> case
> >> of
> >> >> the column name). In the above example, the first row matching the
> >> provided
> >> >> case is returned.
> >> >>
> >> >>
> >> >> > select *Name, name, NAME* from mfs.`/tmp/json/a`;
> >> >> +-------+--------+--------+
> >> >> | Name  | name0  | NAME1  |
> >> >> +-------+--------+--------+
> >> >> | ABC   | ABC    | ABC    |
> >> >> +-------+--------+--------+
> >> >>
> >> >> > select *NAME, Name, name* from mfs.`/tmp/json/a`;
> >> >> +-------+--------+--------+
> >> >> | NAME  | Name0  | name1  |
> >> >> +-------+--------+--------+
> >> >> | XYZ   | XYZ    | XYZ    |
> >> >> +-------+--------+--------+
> >> >>
> >> >>
> >> >> If Pushdown features are disabled, the behavior seen above would
> indeed
> >> >> match JSON files. However, this could come at a cost of not fully
> >> utilizing
> >> >> the power of the underlying data-source, and could lead to
> performance
> >> >> issues.
> >> >>
> >> >> *In-consistent Results can happen when:*
> >> >>
> >> >> (1) Dataset has mixed-cases for fields. Example seen above. While
> this
> >> >> might not be very common, the concerns are still valid*, *since
> >> substantial
> >> >> Drill users are exploring Drill for ETL cases where Data is not
> >> completely
> >> >> sanitized.
> >> >>
> >> >> (2) Data is consistent w.r.t case, but the query text has
> non-matching
> >> >> case. While some could term this as user error, it could still cause
> >> issues
> >> >> when users, applications or the underlying datasources change.
> >> >>
> >> >> In both the above cases, Drill would silently perform the query and
> >> return
> >> >> results which could be either *none, partial, complete/correct or
> >> entirely*
> >> >> *wrong*.
> >> >>
> >> >> Some specific questions:
> >> >>
> >> >> (1) *Supporting Case-In-sensitive Behavior for Case-Sensitive
> >> Data-sources.
> >> >> *For users who prefer the flexibility, how can Drill ensure that the
> >> >> underlying data-source can return case-insensitive results.
> >> >>
> >> >> (2) *Supporting Case-Sensitive Behavior. *How can Drill OPTIONALLY
> >> support
> >> >> case-sensitive behavior for data-sources. Users coming from
> >> case-sensitive
> >> >> databases might want results matching the provided case. Example
> using
> >> the
> >> >> above data:
> >> >>
> >> >> > select _id, *Name, name, NAME* from mfs.`/tmp/json/a`;
> >> >> +------+-------+--------+--------+
> >> >>
> >> >> | _id  | Name  | name   | NAME   |
> >> >> +------+-------+--------+--------+
> >> >> | ID1  | ABC   | null   | null   |
> >> >> +------+-------+--------+--------+
> >> >> | ID2  | null  | PQR    | null   |
> >> >> +------+-------+--------+--------+
> >> >> | ID3  | null  | null   | XYZ    |
> >> >> +------+-------+--------+--------+
> >> >>
> >> >>
> >> >> (3) How does Drill currently work with *MongoDB*, which i guess is a
> >> >> case-sensitive database? Have these issues ever been discussed
> >> previously?
> >> >>
> >> >>
> >> >> Thanks in advance. I'd appreciate any helpful response.
> >> >>
> >> >> Regards,
> >> >> Abhishek
> >> >>
> >> >>
> >> >> [1]
> https://drill.apache.org/docs/lexical-structure/#case-sensitivity
> >> >>
> >>
>

Re: Working with Case-Sensitive Data-sources

Reply via email to