Re: A question about a potential bug in Druid Joins

2021-06-24 Thread Jason Chen
Hey Gian,

Thank you very much for your quick reply. The explanation is pretty clear. I 
can confirm that the query works with `api_client_id + 1`:

> quote_type
> WITH DIM AS (
>   SELECT (api_client_id + 0) AS api_client_id, title
>   FROM sales_model_api_clients_dimension AS API_CLIENTS
>   WHERE API_CLIENTS.shop_id = 25248974
> ),
> FACTS AS (
>   SELECT api_client_id, COUNT(*) as api_client_count
>   FROM sales_facts AS ORDERS
>   WHERE ORDERS.__time >= TIMESTAMP '2021-06-17 00:00:00' AND ORDERS.__time < 
> TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id = 25248974
>   GROUP BY 1
> )
>
> SELECT DIM.api_client_id, DIM.title, FACTS.api_client_count
> FROM FACTS
> LEFT JOIN DIM ON FACTS.api_client_id = DIM.api_client_id



The right hand side scan subquery with a virtual column that contains the 
“outputType”:


      "right":{
         "type":"query",
         "query":{
            "queryType":"scan",
            "dataSource":{
               "type":"table",
               "name":"api_clients_inline_dimension_20210619"
            },
            "intervals":{
               "type":"intervals",
               "intervals":[
                  "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
               ]
            },
            "virtualColumns":[
               {
                  "type":"expression",
                  "name":"v0",
                  "expression":"(\"api_client_id\" + 0)",
                  "outputType":"LONG"
               }
            ],
            "resultFormat":"compactedList",
            "batchSize":20480,
            "order":"none",
            "filter":null,
            "columns":[
               "title",
               "v0"
            ],
            "legacy":false,
            "context":{
               "sqlOuterLimit":100,
               "sqlQueryId":"a878dfe1-94d4-4b7f-b0b7-d36290e763da"
            },
            "descending":false,
            "granularity":{
               "type":"all"
            }
         }
      },
      "rightPrefix":"j0.",
      "condition":"(\"d0\" == \"j0.v0\")",
      "joinType":"LEFT"
   },


For comparison, the original right-hand-side subquery which has columns without 
the types:

      "right":{
         "type":"query",
         "query":{
            "queryType":"scan",
            "dataSource":{
               "type":"table",
               "name":"api_clients_inline_dimension_20210619"
            },
            "intervals":{
               "type":"intervals",
               "intervals":[
                  "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
               ]
            },
            "virtualColumns":[

            ],
            "resultFormat":"compactedList",
            "batchSize":20480,
            "order":"none",
            "filter":null,
            "columns":[
               "api_client_id",
               "title"
            ],
            "legacy":false,
            "context":{
               "sqlOuterLimit":100,
               "sqlQueryId":"ea2c369e-c553-424f-b2ec-13d282ec315c"
            },
            "descending":false,
            "granularity":{
               "type":"all"
            }
         }
      },



> quote_type
>

Regards,
Jianbin Chen
On Jun 24, 2021, 5:08 PM -0400, dev@druid.apache.org, wrote:
>
> the DIM CTE is getting run as a Scan subquery, and the type information is


Re: A question about a potential bug in Druid Joins

2021-06-24 Thread Gian Merlino
Hey Jason,

I suppose you're talking about this patch:
https://github.com/apache/druid/pull/10942

1) The patch applies to a situation where the left-hand column of a join
condition is long-typed, and the right-hand side is string-typed. I suspect
the DIM CTE is getting run as a Scan subquery, and the type information is
getting lost here:
https://github.com/apache/druid/blob/druid-0.21.1/processing/src/main/java/org/apache/druid/query/scan/ScanQueryQueryToolChest.java#L177-L178.
That means that when we build an index on the subquery, we'll do it using
string keys, since strings are used when the actual type is unknown. The
code also suggests a possible workaround. If you look a few lines up, you
can see that type info is only present for virtual columns. So you could
try doing "SELECT api_client_id + 0" instead of "SELECT api_client_id",
which should cause a virtual column to be generated and thereby preserve
the type info. I haven't tested this, though, so let us know if it works
. And like the comment says: in the future we'd like to be able to fill
in the real type info. [* see note below for details]

2) I'm not sure when the next Druid release will be published. There's
usually one release every few months. The next release from master will
include this patch. That'll likely be the very next release, unless a patch
release is needed for some reason. (Patch releases are usually made from
the prior release branch instead of from master.)

[*] Some details on why we don't fill in type info here today. The
challenge is that the native query toolchests don't currently know the data
types that will actually be encountered. Usually that's fine, because we
only need to know the *result* types, and most native query outputs are
strongly typed. For example, the groupBy toolchest is able to introspect
all of its dimensions and aggregators to generate a fully typed result
signature. This isn't possible for plain Scan queries though. All it has is
the names of the input columns. So, we'd need to carry the type info
through from somewhere else; perhaps from the SQL layer.


On Thu, Jun 24, 2021 at 1:27 PM Jason Chen 
wrote:

> Hello, Druid community,
>
> Ben Krug from Imply points me to this mail list for my question about
> Druid Joins. We have a following Druid Join query that may trigger a bug in
> Druid:
> > quote_type
> > WITH DIM AS (
> >   SELECT api_client_id, title
> >   FROM inline_dimension_api_clients_1 AS API_CLIENTS
> > ),
> > FACTS AS (
> >   SELECT api_client_id, COUNT(*) as api_client_count
> >   FROM inline_data AS ORDERS
> >   WHERE ORDERS.__time >= TIMESTAMP '2021-06-10 00:00:00' AND
> ORDERS.__time < TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id =
> 25248974
> >   GROUP BY 1
> > )
> > SELECT DIM.title, FACTS.api_client_id, FACTS.api_client_count
> > FROM FACTS
> > LEFT JOIN DIM ON FACTS.api_client_id = DIM.api_client_id
>
> So the “api_client_id” field is `long` type in both
> “inline_data” and “inline_dimension_api_clients_1” datasources. However,
> when doing a join, the makeLongProcessor method will be called, and
> throw an “UnsupportedOperationException" because "index.keyType()" is
> string in MapIndex.
>
> Then I found Gian Merlino has a PR to fix the issue. I have validated that
> this fix works for our case in my local Druid cluster. The fix is not
> included in Druid v0.21.1.
>
> I have the following questions:
>
> 1. Why the index key type is `string` rather than `long` for my subquery?
> Is it implicitly transformed to `string` type for performance benefit?
> 2. When will you publish a new Druid release? Will the fix be part of the
> next release?
>
>
> Thank you
> Jason Chen
>
>
>
> Jason (Jianbin) Chen
> Senior Data Developer
> p: +1 2066608351 | e: jason.c...@shopify.com
> a: 234 Laurier Ave W Ottawa, ON K1N 5X8
>


Re: A question about a potential bug in Druid Joins

2021-06-24 Thread Ben Krug
To add details for any devs who might be able to answer this, the query
given does not throw an error if the GROUP BY is taken out.  The grouped by
field, api_client_id, is a long, but it seems as if grouping by it somehow
gets it cast to string?  Jason was asking why this would be, and where in
the code.

On Thu, Jun 24, 2021 at 1:27 PM Jason Chen 
wrote:

> Hello, Druid community,
>
> Ben Krug from Imply points me to this mail list for my question about
> Druid Joins. We have a following Druid Join query that may trigger a bug in
> Druid:
> > quote_type
> > WITH DIM AS (
> >   SELECT api_client_id, title
> >   FROM inline_dimension_api_clients_1 AS API_CLIENTS
> > ),
> > FACTS AS (
> >   SELECT api_client_id, COUNT(*) as api_client_count
> >   FROM inline_data AS ORDERS
> >   WHERE ORDERS.__time >= TIMESTAMP '2021-06-10 00:00:00' AND
> ORDERS.__time < TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id =
> 25248974
> >   GROUP BY 1
> > )
> > SELECT DIM.title, FACTS.api_client_id, FACTS.api_client_count
> > FROM FACTS
> > LEFT JOIN DIM ON FACTS.api_client_id = DIM.api_client_id
>
> So the “api_client_id” field is `long` type in both
> “inline_data” and “inline_dimension_api_clients_1” datasources. However,
> when doing a join, the makeLongProcessor method will be called, and
> throw an “UnsupportedOperationException" because "index.keyType()" is
> string in MapIndex.
>
> Then I found Gian Merlino has a PR to fix the issue. I have validated that
> this fix works for our case in my local Druid cluster. The fix is not
> included in Druid v0.21.1.
>
> I have the following questions:
>
> 1. Why the index key type is `string` rather than `long` for my subquery?
> Is it implicitly transformed to `string` type for performance benefit?
> 2. When will you publish a new Druid release? Will the fix be part of the
> next release?
>
>
> Thank you
> Jason Chen
>
>
>
> Jason (Jianbin) Chen
> Senior Data Developer
> p: +1 2066608351 | e: jason.c...@shopify.com
> a: 234 Laurier Ave W Ottawa, ON K1N 5X8
>


A question about a potential bug in Druid Joins

2021-06-24 Thread Jason Chen
Hello, Druid community,

Ben Krug from Imply points me to this mail list for my question about Druid 
Joins. We have a following Druid Join query that may trigger a bug in Druid:
> quote_type
> WITH DIM AS (
>   SELECT api_client_id, title
>   FROM inline_dimension_api_clients_1 AS API_CLIENTS
> ),
> FACTS AS (
>   SELECT api_client_id, COUNT(*) as api_client_count
>   FROM inline_data AS ORDERS
>   WHERE ORDERS.__time >= TIMESTAMP '2021-06-10 00:00:00' AND ORDERS.__time < 
> TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id = 25248974
>   GROUP BY 1
> )
> SELECT DIM.title, FACTS.api_client_id, FACTS.api_client_count
> FROM FACTS
> LEFT JOIN DIM ON FACTS.api_client_id = DIM.api_client_id

So the “api_client_id” field is `long` type in both “inline_data” and 
“inline_dimension_api_clients_1” datasources. However, when doing a join, the 
makeLongProcessor method will be called, and throw an 
“UnsupportedOperationException" because "index.keyType()" is string in MapIndex.

Then I found Gian Merlino has a PR to fix the issue. I have validated that this 
fix works for our case in my local Druid cluster. The fix is not included in 
Druid v0.21.1.

I have the following questions:

1. Why the index key type is `string` rather than `long` for my subquery? Is it 
implicitly transformed to `string` type for performance benefit?
2. When will you publish a new Druid release? Will the fix be part of the next 
release?


Thank you
Jason Chen



Jason (Jianbin) Chen
Senior Data Developer
p: +1 2066608351 | e: jason.c...@shopify.com
a: 234 Laurier Ave W Ottawa, ON K1N 5X8