Nick Muerdter created KYLIN-1527:
------------------------------------
Summary: Columns with all NULL values can't be queried
Key: KYLIN-1527
URL: https://issues.apache.org/jira/browse/KYLIN-1527
Project: Kylin
Issue Type: Bug
Affects Versions: v1.5.0
Reporter: Nick Muerdter
I'm running into some issues in Kylin 1.5 when trying to query columns that
contain only NULL values. Here's a quick description of my setup and what I'm
seeing:
- I have a Hive ORC table with a "denied_reason" column.
- The data is partitioned by day.
- Before 2015-02-09, the values for "denied_reason" are all NULL.
- Beginning on 2015-02-09, the values of "denied_reason" are a combination of
NULLs and various string values.
If I build a new data cube with the initial build going until 2015-02-08
(during which time the "denied_reason" column will always be NULL), and then I
try to query the resulting cube, any query involving the "denied_reason" column
will fail with this error:
{code}
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
> Error while executing SQL "SELECT COUNT(*) FROM default.logs WHERE
> denied_reason IS NULL LIMIT 50000": Dictionary for DEFAULT.LOGS.DENIED_REASON
> is not found
{code}
If I then re-build the existing cube to include additional days after
2015-02-09 (during which time the "denied_reason" column has values), the query
will still fail with the same "Dictionary for DEFAULT.LOGS.DENIED_REASON is not
found" error.
If I attempt to merge the segments from before 2015-02-09 with the segments
afterwards, I get this error during the segment "Merge Cuboid Data" phase:
{code}
Error: java.lang.NullPointerException at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187) at
com.google.common.cache.LocalCache.get(LocalCache.java:3964) at
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) at
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
at
org.apache.kylin.dict.DictionaryManager.getDictionaryInfo(DictionaryManager.java:114)
at
org.apache.kylin.dict.DictionaryManager.getDictionary(DictionaryManager.java:108)
at
org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:183)
at
org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:62)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{code}
If I purge the cube and perform the initial build with dates spanning both
before and after 2015-02-09 (so during this time the "denied_reason" column
would both have days containing all NULL values, as well as real values), then
my queries work as expected:
{code}
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
> 2285375
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NOT NULL;
> 35936
{code}
So it seems like the issue is that if the data cube initially gets built with a
column only containing NULL values, then that column becomes unqueryable, and
the segments can't be merged. However, if the cube is initially built with at
least some data in each column, then all the NULL data works as expected.
I hope that makes sense, but let me know if anything isn't clear. And if it
helps, I'm pretty sure I had done similar builds with NULL data in Kylin 1.2
without issues, so this might be a new issue somewhere between Kylin 1.2 and
1.5 (however, I'm not totally sure about that, so let me know if it would help
for me to get my Kylin 1.2 instance spun up again to verify).
(And I was attempting to come up with a smaller, reproducible case to help
demonstrate this issue, but I actually ended up encountering slightly different
NULL issues when trying to build that test case. Basically, I was getting this
error during the "Build Phase" of the initial cube build if my simpler example
cube was being built against a date range with only NULL data:
https://gist.github.com/GUI/1ddf3624fa1bd10e53cf I wasn't sure if that related
to this same NULL issue I outlined above, so let me know if you'd like for me
to provide more detail or open a separate issue for that.)
Thanks you!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)