Hi everyone,

When accessing Druid through Calcite JDBC and Druid adapter from Spark, I have 
been experiencing strange results. 

Druid data schema is defined as:

    {
      "type": "custom",
      "name": “xxx",
      "factory": "org.apache.calcite.adapter.druid.DruidSchemaFactory",
      "operand": {
        "url": "http://<host>:8082",
        "coordinatorUrl": "http://<host>:8081"
      },
      "tables": [
        {
          "name": “yyy",
          "factory": "org.apache.calcite.adapter.druid.DruidTableFactory",
          "operand": {
            "dataSource": “<DruidDataSource>t",
            "interval": "2016-11-08T00:00:00.000Z/2016-12-31T00:00:00.000Z",
            "timestampColumn": "timestamp",
            "dimensions": [
              "commentorId", 
              "hashTagId",
              “value"
            ],
          "metrics": [
                                {
                                  "type" : "count",
                                  "name" : "count"
                                }
                   ]
          }
        }
      ]

with a  JDBC client, queries like these work fine:
1. select * from yay where hashTagId=“hashTag_01”
2. select badcount.hashTagId, badcount.bad,totalcount.total, 
badcount.bad/totalcount.total*100 as bad_pct 
        from 
                (select hashTagId, cast(count(*) as double) as bad from yyy 
where value='bad' group by hashTagId) as badcount
                    join 
                                (select hashTagId, cast(count(*) as double) as 
total from yyy group by hashTagId) as totalcount
                            on 
                                        
(badcount.hashTagId=totalcount.hashTagId)

However, in spark 2.0, it is strange
1. df_yyy = spark.read.format(“jdbc”).option(“url”,jdbc:calcite:model=<path to 
schema json>;caseSensitive=false”)…
2. df_yyy.show() —— works fine, returns all records
3. df_yyy.filter($”hashTagId”=“hashTag_01”).count() — returns the correct 
number of records
4. df_yyy.filter($”hashTagId”=“hashTag_01”).show() — returns empty result set
5. df_yyy.join(<another mysql based data frame>, <condition>).show() —— returns 
empty result set (any joins returns empty result set)

I am suspecting there are conflicts between  how spark parses SQL and how 
Calcite JDBC does. are there special properties to set as of JDBC string to 
make it work with Spark? Is there a Calcite JDBC log file that I can dig 
through? I did some googling and don’t see similar usage with 
spark/calcite/druid, is this the right way accessing druid from spark? (may be 
this is a question better for spark/druid community…)

Thanks.
Herman.

Reply via email to