Richard Calaba created KYLIN-1508:
-------------------------------------
Summary: NPE error ERROR - Build Cube - Step 3 if LOOKUP table is
Hive View - : java.lang.NullPointerException at
org.apache.kylin.source.hive.HiveTable.getSignature
Key: KYLIN-1508
URL: https://issues.apache.org/jira/browse/KYLIN-1508
Project: Kylin
Issue Type: Bug
Components: Job Engine
Affects Versions: v1.2, v1.3.0, v1.5.0, all
Environment: CentOS 6.7 , Hortonworks Hadoop 2.2.4.2 - sandbox
Reporter: Richard Calaba
Assignee: Dong Li
Affected releases: The error occurs in 1.2 release (downloaded as binary
release from Kylin Home page). The same error occurs also in ??latest??
snapshot 2.1 (compiled today 3/12/16 from Kylin Git using bracnh 2.x-staging).
Tests were done on Hortonworks HDP 2.2.4 sandbox. Thus I assume that all
current branches are affected.
Error Summary: Whereas it is possible to successfully build cube which FACT
table referring Hive View – it seems not to be possible to build a cube which
has LOOKUP table referring Hive View. It seems to be bug in metadata processing
in Step 3 of the Build process. Build process fails with exception:
(release 1.2)
java.io.IOException: java.lang.NullPointerException
at
org.apache.kylin.dict.lookup.HiveTable.getSignature(HiveTable.java:72)
at
org.apache.kylin.dict.lookup.SnapshotTable.<init>(SnapshotTable.java:62)
at
org.apache.kylin.dict.lookup.SnapshotManager.buildSnapshot(SnapshotManager.java:85)
at
org.apache.kylin.cube.CubeManager.buildSnapshotTable(CubeManager.java:205)
at
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:60)
at
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:41)
at
org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:52)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:62)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
at
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:51)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
at
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:130)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at
org.apache.kylin.common.util.HadoopUtil.fixWindowsPath(HadoopUtil.java:76)
at
org.apache.kylin.common.util.HadoopUtil.makeURI(HadoopUtil.java:68)
at
org.apache.kylin.common.util.HadoopUtil.getFileSystem(HadoopUtil.java:63)
at
org.apache.kylin.dict.lookup.FileTable.getSizeAndLastModified(FileTable.java:85)
at
org.apache.kylin.dict.lookup.HiveTable.getSignature(HiveTable.java:57)
... 16 more
Error in latest branch - (snapshoot-2.1) :
java.io.IOException: java.lang.NullPointerException
at
org.apache.kylin.source.hive.HiveTable.getSignature(HiveTable.java:71)
at
org.apache.kylin.dict.lookup.SnapshotTable.<init>(SnapshotTable.java:64)
at
org.apache.kylin.dict.lookup.SnapshotManager.buildSnapshot(SnapshotManager.java:89)
at
org.apache.kylin.cube.CubeManager.buildSnapshotTable(CubeManager.java:208)
at
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:59)
at
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
at
org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(CreateDictionaryJob.java:56)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.kylin.engine.mr.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:60)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
at
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
at
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:113)
at
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:124)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Importance of resolution of this bug: I believe that having the ability to use
Hive Views is essential and should work for both FACT tables/views and
LOOKUP/DIMENSION tables/views. 1st it helps to do more efficient OLAP staging
by not forcing to materialize temporary views used just for OLAP modelling.
2ndly and most importantly having views with left outer and inner joins can
further overcome the current Kylin limitation of not beeing to able to provide
snowflake-like models.
Further info - reproduce scenario: I have created 2 Hive Tables (FACT and
LOOKUP) using CREATE TABLE AS SELECT … statement in Hive. Then I have created 2
Hive Views using CREATE VIEW AS SELECT … supplying exactly same SQL statements
(using left outer and inner joins). All 4 tables/views are correctly producing
data in Hive. If I build simple Cube in Kylin using 2 tables – everything works
fine. If I use Hive view for FACT table and regular Hive Table for LOOKUP – it
also works. The error occurs if I try to use Hiev View as LOOKUP table and
define dimension from it.
Just for completeness - below is full Hive QL for creating the table/views. It
operates on tables which are created by internal tests while building Kylin
1.2. But I believe the exact view definition should not matter. I did couple of
additional tests to create very simple views by doing only "SELECT * FROM
one_table" and the error was still the same.
Hive Query Editor : Create Fact as View
CREATE VIEW view_fact_tab_enhanced
AS SELECT *
FROM DEFAULT.TEST_KYLIN_FACT AS TEST_KYLIN_FACT LEFT OUTER JOIN
EDW.TEST_SELLER_TYPE_DIM AS TEST_SELLER_TYPE_DIM
ON TEST_KYLIN_FACT.SLR_SEGMENT_CD = TEST_SELLER_TYPE_DIM.SELLER_TYPE_CD
Hive Query Editor : Create Fact as Table
CREATE TABLE table_fact_tab_enhanced
AS SELECT *
FROM DEFAULT.TEST_KYLIN_FACT AS TEST_KYLIN_FACT LEFT OUTER JOIN
EDW.TEST_SELLER_TYPE_DIM AS TEST_SELLER_TYPE_DIM
ON TEST_KYLIN_FACT.SLR_SEGMENT_CD = TEST_SELLER_TYPE_DIM.SELLER_TYPE_CD
Hive Query Editor : Create LOOKUP as View
CREATE VIEW view_dim_tab_self_join AS
SELECT distinct A.leaf_categ_id, B.leaf_categ_name, A.site_id
FROM DEFAULT.TEST_CATEGORY_GROUPINGS AS A LEFT OUTER JOIN
DEFAULT.TEST_CATEGORY_GROUPINGS AS B
ON A.leaf_categ_id = B.leaf_categ_id AND A.site_id = B.site_id ORDER BY
A.leaf_categ_id, B.leaf_categ_name, A.site_id
Hive Query Editor : Create LOOKUP as Table
CREATE TABLE table_dim_tab_self_join AS
SELECT distinct A.leaf_categ_id, B.leaf_categ_name, A.site_id
FROM DEFAULT.TEST_CATEGORY_GROUPINGS AS A LEFT OUTER JOIN
DEFAULT.TEST_CATEGORY_GROUPINGS AS B
ON A.leaf_categ_id = B.leaf_categ_id AND A.site_id = B.site_id ORDER BY
A.leaf_categ_id, B.leaf_categ_name, A.site_id
Additional curiosity – seems release 1.2 is handling better the final size of
the generated Cube – assuming I didn’t make any mistake in cube definitions and
used same in both 1.2-release and 2.1-SNAPSHOT, I am getting smaller cube size
in release 1.2 than in 2.1-SNAPSHOT codeline - in both cases the number of rows
is exactly same.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)