I handled several such cases in eBay deployment, and usually it is the type overflow; After change the type in hive (INT ->BIGINT, or FLOAD -> DOUBLE), you also need sync the table schema to Kylin again; Just give a try and let us know whether it solves your issue;
@yang, I do think Kylin should always use BITINT or DOUBLE for the measure columns, instead of using the definition in hive; Was there any concern on this? On 8/6/15, 5:23 PM, "alex schufo" <[email protected]> wrote: >This is a QA environment where I fully control the Hive tables and I >loaded >this dataset manually and didn't make any change since. The cube itself >has >been created yesterday so no I don't see any Hive update after the cube >build. > >I didn't have the chance to modify the Hive table use BIGINT as well but >will do when I have time. > >On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <[email protected]> wrote: > >> Integer overflow is possible. I do remember some FIXME there. >> >> Incorrect SUM is quite impossible. Regression test reconciles SUM result >> with H2 database every day. Maybe the hive table is updated after cube >> build? Try refresh cube to ensure it captures latest hive data. >> >> >> >> On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <[email protected]> wrote: >> >> > I'll try to reproduce the issue in my own environment >> > >> > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <[email protected]> wrote: >> > >> > > That's fine, please change your hive schema first to make sure they >>are >> > > same, >> > > then try again. >> > > >> > > Thanks. >> > > >> > > >> > > >> > > Best Regards! >> > > --------------------- >> > > >> > > Luke Han >> > > >> > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <[email protected]> >> > wrote: >> > > >> > > > I am using 0.7.2. >> > > > >> > > > The Hive column is int, does that mean I should try to modify the >> > > existing >> > > > Hive table to move to Bigint as well to have the Kylin count as a >> > Bigint? >> > > > >> > > > I probably won't be allowed to share this kind of data outside my >> > > > organization unfortunately. >> > > > >> > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <[email protected]> >> > wrote: >> > > > >> > > > > except for what Luke asked, can you also share which version are >> you >> > > > using? >> > > > > We'll look into this >> > > > > >> > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <[email protected]> >> wrote: >> > > > > >> > > > > > What's the data type in your hive table of that column? >> > > > > > We recommend to use same as your measure definition. >> > > > > > >> > > > > > For data quality, we have approach to ensure it will 100% >> > > > > > match with source data, there's H2 comparison in our testing >> case. >> > > > > > But yes, let's double check the test case about huge number >>like >> > your >> > > > > case >> > > > > > >> > > > > > Is it possible to share your data and model to us if not >> sensitive? >> > > > > > It will be better to using that to do the verify and testing >> here. >> > > > > > >> > > > > > Thanks. >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > Best Regards! >> > > > > > --------------------- >> > > > > > >> > > > > > Luke Han >> > > > > > >> > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo < >> [email protected] >> > > >> > > > > wrote: >> > > > > > >> > > > > > > I have a simple cube with a Fact Table and a Lookup Table. >>The >> > Fact >> > > > > Table >> > > > > > > includes some counter. The Lookup Table includes a region >> > property >> > > > > which >> > > > > > is >> > > > > > > the highest level of the hierarchy and therefore can >>accumulate >> > > > quite a >> > > > > > lot >> > > > > > > of data when summing the counter by region. >> > > > > > > >> > > > > > > I am doing a query like this: >> > > > > > > >> > > > > > > select sum(MY_COUNTER) as tot_count, REGION >> > > > > > > >> > > > > > > from FACT_TABLE >> > > > > > > >> > > > > > > join LOOKUP_TABLE >> > > > > > > >> > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK >> > > > > > > >> > > > > > > where date = '2015-01-01' >> > > > > > > >> > > > > > > group by REGION >> > > > > > > >> > > > > > > >> > > > > > > I get the following result: >> > > > > > > >> > > > > > > -917,164,421 EMEA --> negative result! >> > > > > > > >> > > > > > > 777,795,808 AMER >> > > > > > > >> > > > > > > 514,879,134 APAC >> > > > > > > >> > > > > > > 117,814,485 LATAM >> > > > > > > >> > > > > > > I ran the exact same query in Hive and got the following >> result: >> > > > > > > >> > > > > > > 3,381,591,228 EMEA --> big number but smaller than the >>BIGINT >> > > limit >> > > > > > > >> > > > > > > 778,718,563 AMER --> 922,755 difference with the Kylin >> result >> > > > > > > >> > > > > > > 520,253,610 APAC --> 5,374,476 difference with the >>Kylin >> > > result >> > > > > > > >> > > > > > > 117,913,857 LATAM --> 99,372 difference with the Kylin >> result >> > > > > > > >> > > > > > > Based on that it seems that the limit is the int limit >>2^31-1 >> > > > > > > (2,147,483,647) and that my negative result comes from an >> > overflow. >> > > > > > > >> > > > > > > In my cube I defined this measure as a bigint: >> > > > > > > >> > > > > > > Expression Param Type Param Value Return Type >> > > > > > > >> > > > > > > SUM column AVAIL_COUNT bigint >> > > > > > > >> > > > > > > The other thing that worries me a bit is that the other >>numbers >> > are >> > > > not >> > > > > > > 100% accurate either (>100,000,000 difference!) >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Regards, >> > > > > >> > > > > *Bin Mahone | 马洪宾* >> > > > > Apache Kylin: http://kylin.io >> > > > > Github: https://github.com/binmahone >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Regards, >> > >> > *Bin Mahone | 马洪宾* >> > Apache Kylin: http://kylin.io >> > Github: https://github.com/binmahone >> > >>
