So with 0.7.2 the cube builds, and I can see some improvement:
"select * from SAMPLE_DIM" now returns all the fields, i.e:
dim1, dim2, dim3, etc., SAMPLE_ID
and I can see all the values for each field.
However the join between the fact table and the lookup table still does not
work, it returns:
Can't find any realization.
And if I do "select SAMPLE_ID from SAMPLE_DIM group by SAMPLE_ID" it also
returns:
Can't find any realization.
If I do "select SAMPLE_ID from FACT_TABLE group by SAMPLE_ID" then I get
the list of all SAMPLE_ID as expected.
If I do "select dim1 from SAMPLE_DIM group by dim1" I also get the list of
all dim1 as expected.
The same exact query works perfectly on Hive (although it takes a long time
to be processed of course).
Am I doing something wrong?
On Wed, Jul 29, 2015 at 1:35 PM, alex schufo <[email protected]> wrote:
> Ok I guess this is https://issues.apache.org/jira/browse/KYLIN-831, right?
>
> I upgraded today to 0.7.2 and hope it solves the problem then.
>
> Regards
>
> On Tue, Jul 28, 2015 at 5:52 PM, alex schufo <[email protected]> wrote:
>
>> I still don't understand this.
>>
>> I have a simple fact table and a simple SAMPLE_DIM lookup table. They are
>> joined on SAMPLE_ID.
>>
>> If I do like you say and include all the columns of SAMPLE_DIM as a
>> hierarchy and do not include the SAMPLE_ID then the cube builds
>> successfully but I cannot query with the hierarchy. Any join results in
>> this error:
>>
>> Column 'SAMPLE_ID' not found in table 'SAMPLE_DIM'
>>
>> Indeed if I do a select * from 'SAMPLE_DIM' I can see all the hierarchy
>> but not the SAMPLE_ID used to join with the fact table.
>>
>> If I include the SAMPLE_ID in the hierarchy definition then the cube
>> build fails on step 3 with:
>>
>> java.lang.NullPointerException: Column DEFAULT.FACT_TABLE.SAMPLE_ID does
>> not exist in row key desc
>> at org.apache.kylin.cube.model.RowKeyDesc.getColDesc(RowKeyDesc.java:158)
>> at
>> org.apache.kylin.cube.model.RowKeyDesc.getDictionary(RowKeyDesc.java:152)
>> at
>> org.apache.kylin.cube.model.RowKeyDesc.isUseDictionary(RowKeyDesc.java:163)
>> at
>> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:51)
>> at
>> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
>> at
>> org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:53)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>> at
>> org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
>> at
>> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
>> at
>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:132)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>> (the SAMPLE_ID *does* exist in the FACT_TABLE)
>>
>> The only scenario I could make it work is when I also create a derived
>> dimension SAMPLE_ID / something else, then somehow the SAMPLE_ID is
>> included and can be queried.
>>
>> Any help with that?
>>
>>
>> On Fri, Jun 19, 2015 at 1:37 PM, alex schufo <[email protected]>
>> wrote:
>>
>>> Thanks for the answer,
>>>
>>> Indeed I had a look at these slides before and it's great to understand
>>> the high level concepts but I ended up spending quite some time when
>>> designing my dimensions with the issues mentioned below.
>>>
>>> On Fri, Jun 19, 2015 at 11:23 AM, jason zhong <[email protected]>
>>> wrote:
>>>
>>>> Hi Alex,
>>>>
>>>> We have a slide to hlep you understand how to build cube.I don't know
>>>> whether you have read this? This will hlep you understand derived and
>>>> hierarchy.
>>>>
>>>> http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin
>>>>
>>>> for your case about hierarchy,log_date should not be included in
>>>> hierarchy
>>>> ,here's a bug you help find it.we will follow this.
>>>>
>>>> also .more document and UI enhancement will be done to help user build
>>>> cube
>>>> easily.
>>>>
>>>> Thanks!!
>>>>
>>>> On Fri, Jun 12, 2015 at 5:07 PM, alex schufo <[email protected]>
>>>> wrote:
>>>>
>>>> > I am trying to create a simple cube with a fact table and 3
>>>> dimensions.
>>>> >
>>>> > I have read the different slideshares and wiki pages, but I found
>>>> that the
>>>> > documentation is not very specific on how to manage hierarchies.
>>>> >
>>>> > Let's take this simple example :
>>>> >
>>>> > Fact table: productID, storeID, logDate, numbOfSell, etc.
>>>> >
>>>> > Date lookup table : logDate, week, month, quarter, etc.
>>>> >
>>>> > I specified Left join on logDate, actually when I specify this I find
>>>> it
>>>> > not very clear which one is considered to be the Left table and which
>>>> one
>>>> > is considered to be the Right table. I assumed the Fact table was the
>>>> left
>>>> > table and the Lookup table the right table, looking at it now I think
>>>> that
>>>> > might be a mistake (I am just interested in dates for which there are
>>>> > results in the fact table).
>>>> >
>>>> > If I use the auto generator it creates a derived dimension, I don't
>>>> think
>>>> > that's what I need.
>>>> >
>>>> > So I created a hierarchy, but again to me it's clearly indicated if I
>>>> > should create ["quarter", "month", "week", "log_date"] or ["logDate",
>>>> > "week", "month", "quarter"]?
>>>> >
>>>> > Also should I include log_date in the hierarchy? To me it was more
>>>> > intuitive not to include it because it's already the join, but it
>>>> created
>>>> > the cube without it and I cannot query by date, it says that
>>>> "log_date" is
>>>> > not found in the date table (it is in the Hive table but not the cube
>>>> > built). If I include it in the hierarchy the cube build fails with
>>>> this
>>>> > error :
>>>> >
>>>> > java.lang.NullPointerException: Column DEFAULT.DATE_TABLE.LOG_DATE
>>>> > does not exist in row key desc
>>>> > at
>>>> > org.apache.kylin.cube.model.RowKeyDesc.getColDesc(RowKeyDesc.java:158)
>>>> > at
>>>> >
>>>> org.apache.kylin.cube.model.RowKeyDesc.getDictionary(RowKeyDesc.java:152)
>>>> > at
>>>> >
>>>> org.apache.kylin.cube.model.RowKeyDesc.isUseDictionary(RowKeyDesc.java:163)
>>>> > at
>>>> >
>>>> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:51)
>>>> > at
>>>> >
>>>> org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:53)
>>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
>>>> > at
>>>> >
>>>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:132)
>>>> > at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> > at
>>>> >
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> > at java.lang.Thread.run(Thread.java:744)
>>>> >
>>>> > result code:2
>>>> >
>>>> >
>>>> > I think it might be useful to improve the documentation to explain
>>>> this
>>>> > more clearly and not just the basic steps because building a cube
>>>> even on
>>>> > short time ranges takes some time so learning by trial / error is
>>>> very time
>>>> > consuming.
>>>> >
>>>> > Same thing for the derived dimensions, should I include ["storeID",
>>>> > "storeName"] or just ["storeName"]? The second option seems to work
>>>> for me.
>>>> >
>>>> > Thanks
>>>> >
>>>>
>>>
>>>
>>
>