Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Shi, Shaofeng Sun, 09 Aug 2015 03:23:05 -0700

I handled several such cases in eBay deployment, and usually it is the
type overflow; After change the type in hive (INT ->BIGINT, or FLOAD ->
DOUBLE), you also need sync the table schema to Kylin again; Just give a
try and let us know whether it solves your issue;


@yang, I do think Kylin should always use BITINT or DOUBLE for the measure
columns, instead of using the definition in hive; Was there any concern on
this?

On 8/6/15, 5:23 PM, "alex schufo" <[email protected]> wrote:

>This is a QA environment where I fully control the Hive tables and I
>loaded
>this dataset manually and didn't make any change since. The cube itself
>has
>been created yesterday so no I don't see any Hive update after the cube
>build.
>
>I didn't have the chance to modify the Hive table use BIGINT as well but
>will do when I have time.
>
>On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <[email protected]> wrote:
>
>> Integer overflow is possible. I do remember some FIXME there.
>>
>> Incorrect SUM is quite impossible. Regression test reconciles SUM result
>> with H2 database every day. Maybe the hive table is updated after cube
>> build? Try refresh cube to ensure it captures latest hive data.
>>
>>
>>
>> On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <[email protected]> wrote:
>>
>> > I'll try to reproduce the issue in my own environment
>> >
>> > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <[email protected]> wrote:
>> >
>> > > That's fine, please change your hive schema first to make sure they
>>are
>> > > same,
>> > > then try again.
>> > >
>> > > Thanks.
>> > >
>> > >
>> > >
>> > > Best Regards!
>> > > ---------------------
>> > >
>> > > Luke Han
>> > >
>> > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <[email protected]>
>> > wrote:
>> > >
>> > > > I am using 0.7.2.
>> > > >
>> > > > The Hive column is int, does that mean I should try to modify the
>> > > existing
>> > > > Hive table to move to Bigint as well to have the Kylin count as a
>> > Bigint?
>> > > >
>> > > > I probably won't be allowed to share this kind of data outside my
>> > > > organization unfortunately.
>> > > >
>> > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <[email protected]>
>> > wrote:
>> > > >
>> > > > > except for what Luke asked, can you also share which version are
>> you
>> > > > using?
>> > > > > We'll look into this
>> > > > >
>> > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <[email protected]>
>> wrote:
>> > > > >
>> > > > > > What's the data type in your hive table of that column?
>> > > > > > We recommend to use same as your measure definition.
>> > > > > >
>> > > > > > For data quality, we have approach to ensure it will 100%
>> > > > > > match with source data, there's H2 comparison in our testing
>> case.
>> > > > > > But yes, let's double check the test case about huge number
>>like
>> > your
>> > > > > case
>> > > > > >
>> > > > > > Is it possible to share your data and model to us if not
>> sensitive?
>> > > > > > It will be better to using that to do the verify and testing
>> here.
>> > > > > >
>> > > > > > Thanks.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Best Regards!
>> > > > > > ---------------------
>> > > > > >
>> > > > > > Luke Han
>> > > > > >
>> > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > I have a simple cube with a Fact Table and a Lookup Table.
>>The
>> > Fact
>> > > > > Table
>> > > > > > > includes some counter. The Lookup Table includes a region
>> > property
>> > > > > which
>> > > > > > is
>> > > > > > > the highest level of the hierarchy and therefore can
>>accumulate
>> > > > quite a
>> > > > > > lot
>> > > > > > > of data when summing the counter by region.
>> > > > > > >
>> > > > > > > I am doing a query like this:
>> > > > > > >
>> > > > > > > select sum(MY_COUNTER) as tot_count, REGION
>> > > > > > >
>> > > > > > > from FACT_TABLE
>> > > > > > >
>> > > > > > > join LOOKUP_TABLE
>> > > > > > >
>> > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
>> > > > > > >
>> > > > > > > where date = '2015-01-01'
>> > > > > > >
>> > > > > > > group by REGION
>> > > > > > >
>> > > > > > >
>> > > > > > > I get the following result:
>> > > > > > >
>> > > > > > > -917,164,421 EMEA --> negative result!
>> > > > > > >
>> > > > > > > 777,795,808 AMER
>> > > > > > >
>> > > > > > > 514,879,134 APAC
>> > > > > > >
>> > > > > > > 117,814,485 LATAM
>> > > > > > >
>> > > > > > > I ran the exact same query in Hive and got the following
>> result:
>> > > > > > >
>> > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
>>BIGINT
>> > > limit
>> > > > > > >
>> > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
>> result
>> > > > > > >
>> > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the
>>Kylin
>> > > result
>> > > > > > >
>> > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
>> result
>> > > > > > >
>> > > > > > > Based on that it seems that the limit is the int limit
>>2^31-1
>> > > > > > > (2,147,483,647) and that my negative result comes from an
>> > overflow.
>> > > > > > >
>> > > > > > > In my cube I defined this measure as a bigint:
>> > > > > > >
>> > > > > > > Expression    Param Type    Param Value       Return Type
>> > > > > > >
>> > > > > > > SUM              column            AVAIL_COUNT    bigint
>> > > > > > >
>> > > > > > > The other thing that worries me a bit is that the other
>>numbers
>> > are
>> > > > not
>> > > > > > > 100% accurate either (>100,000,000 difference!)
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Regards,
>> > > > >
>> > > > > *Bin Mahone | 马洪宾*
>> > > > > Apache Kylin: http://kylin.io
>> > > > > Github: https://github.com/binmahone
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > *Bin Mahone | 马洪宾*
>> > Apache Kylin: http://kylin.io
>> > Github: https://github.com/binmahone
>> >
>>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Reply via email to