Re: Negative number in SUM result and Kylin results not matching exactly Hive results

alex schufo Thu, 06 Aug 2015 02:25:01 -0700

This is a QA environment where I fully control the Hive tables and I loaded
this dataset manually and didn't make any change since. The cube itself has
been created yesterday so no I don't see any Hive update after the cube
build.


I didn't have the chance to modify the Hive table use BIGINT as well but
will do when I have time.

On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <[email protected]> wrote:

> Integer overflow is possible. I do remember some FIXME there.
>
> Incorrect SUM is quite impossible. Regression test reconciles SUM result
> with H2 database every day. Maybe the hive table is updated after cube
> build? Try refresh cube to ensure it captures latest hive data.
>
>
>
> On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <[email protected]> wrote:
>
> > I'll try to reproduce the issue in my own environment
> >
> > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <[email protected]> wrote:
> >
> > > That's fine, please change your hive schema first to make sure they are
> > > same,
> > > then try again.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > Best Regards!
> > > ---------------------
> > >
> > > Luke Han
> > >
> > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <[email protected]>
> > wrote:
> > >
> > > > I am using 0.7.2.
> > > >
> > > > The Hive column is int, does that mean I should try to modify the
> > > existing
> > > > Hive table to move to Bigint as well to have the Kylin count as a
> > Bigint?
> > > >
> > > > I probably won't be allowed to share this kind of data outside my
> > > > organization unfortunately.
> > > >
> > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <[email protected]>
> > wrote:
> > > >
> > > > > except for what Luke asked, can you also share which version are
> you
> > > > using?
> > > > > We'll look into this
> > > > >
> > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <[email protected]>
> wrote:
> > > > >
> > > > > > What's the data type in your hive table of that column?
> > > > > > We recommend to use same as your measure definition.
> > > > > >
> > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > match with source data, there's H2 comparison in our testing
> case.
> > > > > > But yes, let's double check the test case about huge number like
> > your
> > > > > case
> > > > > >
> > > > > > Is it possible to share your data and model to us if not
> sensitive?
> > > > > > It will be better to using that to do the verify and testing
> here.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards!
> > > > > > ---------------------
> > > > > >
> > > > > > Luke Han
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I have a simple cube with a Fact Table and a Lookup Table. The
> > Fact
> > > > > Table
> > > > > > > includes some counter. The Lookup Table includes a region
> > property
> > > > > which
> > > > > > is
> > > > > > > the highest level of the hierarchy and therefore can accumulate
> > > > quite a
> > > > > > lot
> > > > > > > of data when summing the counter by region.
> > > > > > >
> > > > > > > I am doing a query like this:
> > > > > > >
> > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > >
> > > > > > > from FACT_TABLE
> > > > > > >
> > > > > > > join LOOKUP_TABLE
> > > > > > >
> > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > >
> > > > > > > where date = '2015-01-01'
> > > > > > >
> > > > > > > group by REGION
> > > > > > >
> > > > > > >
> > > > > > > I get the following result:
> > > > > > >
> > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > >
> > > > > > > 777,795,808 AMER
> > > > > > >
> > > > > > > 514,879,134 APAC
> > > > > > >
> > > > > > > 117,814,485 LATAM
> > > > > > >
> > > > > > > I ran the exact same query in Hive and got the following
> result:
> > > > > > >
> > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT
> > > limit
> > > > > > >
> > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
> result
> > > > > > >
> > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin
> > > result
> > > > > > >
> > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> result
> > > > > > >
> > > > > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > > > > (2,147,483,647) and that my negative result comes from an
> > overflow.
> > > > > > >
> > > > > > > In my cube I defined this measure as a bigint:
> > > > > > >
> > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > >
> > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > >
> > > > > > > The other thing that worries me a bit is that the other numbers
> > are
> > > > not
> > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Reply via email to