Nick Muerdter created KYLIN-1527:
------------------------------------

             Summary: Columns with all NULL values can't be queried
                 Key: KYLIN-1527
                 URL: https://issues.apache.org/jira/browse/KYLIN-1527
             Project: Kylin
          Issue Type: Bug
    Affects Versions: v1.5.0
            Reporter: Nick Muerdter


I'm running into some issues in Kylin 1.5 when trying to query columns that 
contain only NULL values. Here's a quick description of my setup and what I'm 
seeing:

- I have a Hive ORC table with a "denied_reason" column.
- The data is partitioned by day.
- Before 2015-02-09, the values for "denied_reason" are all NULL.
- Beginning on 2015-02-09, the values of "denied_reason" are a combination of 
NULLs and various string values.

If I build a new data cube with the initial build going until 2015-02-08 
(during which time the "denied_reason" column will always be NULL), and then I 
try to query the resulting cube, any query involving the "denied_reason" column 
will fail with this error:

{code}
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
> Error while executing SQL "SELECT COUNT(*) FROM default.logs WHERE 
> denied_reason IS NULL LIMIT 50000": Dictionary for DEFAULT.LOGS.DENIED_REASON 
> is not found
{code}

If I then re-build the existing cube to include additional days after 
2015-02-09 (during which time the "denied_reason" column has values), the query 
will still fail with the same "Dictionary for DEFAULT.LOGS.DENIED_REASON is not 
found" error.

If I attempt to merge the segments from before 2015-02-09 with the segments 
afterwards, I get this error during the segment "Merge Cuboid Data" phase:

{code}
Error: java.lang.NullPointerException at 
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187) at 
com.google.common.cache.LocalCache.get(LocalCache.java:3964) at 
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) 
at 
org.apache.kylin.dict.DictionaryManager.getDictionaryInfo(DictionaryManager.java:114)
 at 
org.apache.kylin.dict.DictionaryManager.getDictionary(DictionaryManager.java:108)
 at 
org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:183)
 at 
org.apache.kylin.engine.mr.steps.MergeCuboidMapper.map(MergeCuboidMapper.java:62)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784) at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at 
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at 
java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:415) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{code}

If I purge the cube and perform the initial build with dates spanning both 
before and after 2015-02-09 (so during this time the "denied_reason" column 
would both have days containing all NULL values, as well as real values), then 
my queries work as expected:

{code}
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NULL;
> 2285375
SELECT COUNT(*) FROM default.logs WHERE denied_reason IS NOT NULL;
> 35936
{code}

So it seems like the issue is that if the data cube initially gets built with a 
column only containing NULL values, then that column becomes unqueryable, and 
the segments can't be merged. However, if the cube is initially built with at 
least some data in each column, then all the NULL data works as expected.

I hope that makes sense, but let me know if anything isn't clear. And if it 
helps, I'm pretty sure I had done similar builds with NULL data in Kylin 1.2 
without issues, so this might be a new issue somewhere between Kylin 1.2 and 
1.5 (however, I'm not totally sure about that, so let me know if it would help 
for me to get my Kylin 1.2 instance spun up again to verify).

(And I was attempting to come up with a smaller, reproducible case to help 
demonstrate this issue, but I actually ended up encountering slightly different 
NULL issues when trying to build that test case. Basically, I was getting this 
error during the "Build Phase" of the initial cube build if my simpler example 
cube was being built against a date range with only NULL data: 
https://gist.github.com/GUI/1ddf3624fa1bd10e53cf I wasn't sure if that related 
to this same NULL issue I outlined above, so let me know if you'd like for me 
to provide more detail or open a separate issue for that.)

Thanks you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to