Hi everyone,
When accessing Druid through Calcite JDBC and Druid adapter from Spark, I have
been experiencing strange results.
Druid data schema is defined as:
{
"type": "custom",
"name": “xxx",
"factory": "org.apache.calcite.adapter.druid.DruidSchemaFactory",
"operand": {
"url": "http://<host>:8082",
"coordinatorUrl": "http://<host>:8081"
},
"tables": [
{
"name": “yyy",
"factory": "org.apache.calcite.adapter.druid.DruidTableFactory",
"operand": {
"dataSource": “<DruidDataSource>t",
"interval": "2016-11-08T00:00:00.000Z/2016-12-31T00:00:00.000Z",
"timestampColumn": "timestamp",
"dimensions": [
"commentorId",
"hashTagId",
“value"
],
"metrics": [
{
"type" : "count",
"name" : "count"
}
]
}
}
]
with a JDBC client, queries like these work fine:
1. select * from yay where hashTagId=“hashTag_01”
2. select badcount.hashTagId, badcount.bad,totalcount.total,
badcount.bad/totalcount.total*100 as bad_pct
from
(select hashTagId, cast(count(*) as double) as bad from yyy
where value='bad' group by hashTagId) as badcount
join
(select hashTagId, cast(count(*) as double) as
total from yyy group by hashTagId) as totalcount
on
(badcount.hashTagId=totalcount.hashTagId)
However, in spark 2.0, it is strange
1. df_yyy = spark.read.format(“jdbc”).option(“url”,jdbc:calcite:model=<path to
schema json>;caseSensitive=false”)…
2. df_yyy.show() —— works fine, returns all records
3. df_yyy.filter($”hashTagId”=“hashTag_01”).count() — returns the correct
number of records
4. df_yyy.filter($”hashTagId”=“hashTag_01”).show() — returns empty result set
5. df_yyy.join(<another mysql based data frame>, <condition>).show() —— returns
empty result set (any joins returns empty result set)
I am suspecting there are conflicts between how spark parses SQL and how
Calcite JDBC does. are there special properties to set as of JDBC string to
make it work with Spark? Is there a Calcite JDBC log file that I can dig
through? I did some googling and don’t see similar usage with
spark/calcite/druid, is this the right way accessing druid from spark? (may be
this is a question better for spark/druid community…)
Thanks.
Herman.