I am trying to create a simple cube with a fact table and 3 dimensions.

I have read the different slideshares and wiki pages, but I found that the
documentation is not very specific on how to manage hierarchies.

Let's take this simple example :

Fact table: productID, storeID, logDate, numbOfSell, etc.

Date lookup table : logDate, week, month, quarter, etc.

I specified Left join on logDate, actually when I specify this I find it
not very clear which one is considered to be the Left table and which one
is considered to be the Right table. I assumed the Fact table was the left
table and the Lookup table the right table, looking at it now I think that
might be a mistake (I am just interested in dates for which there are
results in the fact table).

If I use the auto generator it creates a derived dimension, I don't think
that's what I need.

So I created a hierarchy, but again to me it's clearly indicated if I
should create ["quarter", "month", "week", "log_date"] or ["logDate",
"week", "month", "quarter"]?

Also should I include log_date in the hierarchy? To me it was more
intuitive not to include it because it's already the join, but it created
the cube without it and I cannot query by date, it says that "log_date" is
not found in the date table (it is in the Hive table but not the cube
built). If I include it in the hierarchy the cube build fails with this
error :

java.lang.NullPointerException: Column DEFAULT.DATE_TABLE.LOG_DATE
does not exist in row key desc
        at 
org.apache.kylin.cube.model.RowKeyDesc.getColDesc(RowKeyDesc.java:158)
        at 
org.apache.kylin.cube.model.RowKeyDesc.getDictionary(RowKeyDesc.java:152)
        at 
org.apache.kylin.cube.model.RowKeyDesc.isUseDictionary(RowKeyDesc.java:163)
        at 
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:51)
        at 
org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
        at 
org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:53)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at 
org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:132)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

result code:2


I think it might be useful to improve the documentation to explain this
more clearly and not just the basic steps because building a cube even on
short time ranges takes some time so learning by trial / error is very time
consuming.

Same thing for the derived dimensions, should I include ["storeID",
"storeName"] or just ["storeName"]? The second option seems to work for me.

Thanks

Reply via email to