Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Li Yang Thu, 06 Aug 2015 22:25:15 -0700

> Not possible to export and share I'm afraid as it's sensitive information


Understand.  Anyway we will try reproduce with big sums in local test
data.  Also keep an eye on other reports of incorrect sum.

You may also try count(*) to compare Kylin and hive.  If that's also
different, then most likely their data set is different.

On Thu, Aug 6, 2015 at 6:14 PM, alex schufo <[email protected]> wrote:

> The test table is in billions record and in the hundreds of GB. For this
> test query however I am restricting to only 1 day of data which makes it
> some millions records.
>
> Not possible to export and share I'm afraid as it's sensitive information.
>
> On Thu, Aug 6, 2015 at 10:42 AM, Li Yang <[email protected]> wrote:
>
> > Interesting.. I'm thinking how to reproduce your problem then we can
> help..
> >
> > How big is the test hive table? Possible to export and share with us?
> >
> > On Thu, Aug 6, 2015 at 5:23 PM, alex schufo <[email protected]>
> wrote:
> >
> > > This is a QA environment where I fully control the Hive tables and I
> > loaded
> > > this dataset manually and didn't make any change since. The cube itself
> > has
> > > been created yesterday so no I don't see any Hive update after the cube
> > > build.
> > >
> > > I didn't have the chance to modify the Hive table use BIGINT as well
> but
> > > will do when I have time.
> > >
> > > On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <[email protected]> wrote:
> > >
> > > > Integer overflow is possible. I do remember some FIXME there.
> > > >
> > > > Incorrect SUM is quite impossible. Regression test reconciles SUM
> > result
> > > > with H2 database every day. Maybe the hive table is updated after
> cube
> > > > build? Try refresh cube to ensure it captures latest hive data.
> > > >
> > > >
> > > >
> > > > On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <[email protected]>
> > wrote:
> > > >
> > > > > I'll try to reproduce the issue in my own environment
> > > > >
> > > > > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <[email protected]>
> wrote:
> > > > >
> > > > > > That's fine, please change your hive schema first to make sure
> they
> > > are
> > > > > > same,
> > > > > > then try again.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards!
> > > > > > ---------------------
> > > > > >
> > > > > > Luke Han
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I am using 0.7.2.
> > > > > > >
> > > > > > > The Hive column is int, does that mean I should try to modify
> the
> > > > > > existing
> > > > > > > Hive table to move to Bigint as well to have the Kylin count
> as a
> > > > > Bigint?
> > > > > > >
> > > > > > > I probably won't be allowed to share this kind of data outside
> my
> > > > > > > organization unfortunately.
> > > > > > >
> > > > > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > except for what Luke asked, can you also share which version
> > are
> > > > you
> > > > > > > using?
> > > > > > > > We'll look into this
> > > > > > > >
> > > > > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <[email protected]
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > What's the data type in your hive table of that column?
> > > > > > > > > We recommend to use same as your measure definition.
> > > > > > > > >
> > > > > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > > > > match with source data, there's H2 comparison in our
> testing
> > > > case.
> > > > > > > > > But yes, let's double check the test case about huge number
> > > like
> > > > > your
> > > > > > > > case
> > > > > > > > >
> > > > > > > > > Is it possible to share your data and model to us if not
> > > > sensitive?
> > > > > > > > > It will be better to using that to do the verify and
> testing
> > > > here.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards!
> > > > > > > > > ---------------------
> > > > > > > > >
> > > > > > > > > Luke Han
> > > > > > > > >
> > > > > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I have a simple cube with a Fact Table and a Lookup
> Table.
> > > The
> > > > > Fact
> > > > > > > > Table
> > > > > > > > > > includes some counter. The Lookup Table includes a region
> > > > > property
> > > > > > > > which
> > > > > > > > > is
> > > > > > > > > > the highest level of the hierarchy and therefore can
> > > accumulate
> > > > > > > quite a
> > > > > > > > > lot
> > > > > > > > > > of data when summing the counter by region.
> > > > > > > > > >
> > > > > > > > > > I am doing a query like this:
> > > > > > > > > >
> > > > > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > > > > >
> > > > > > > > > > from FACT_TABLE
> > > > > > > > > >
> > > > > > > > > > join LOOKUP_TABLE
> > > > > > > > > >
> > > > > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > > > > >
> > > > > > > > > > where date = '2015-01-01'
> > > > > > > > > >
> > > > > > > > > > group by REGION
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I get the following result:
> > > > > > > > > >
> > > > > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > > > > >
> > > > > > > > > > 777,795,808 AMER
> > > > > > > > > >
> > > > > > > > > > 514,879,134 APAC
> > > > > > > > > >
> > > > > > > > > > 117,814,485 LATAM
> > > > > > > > > >
> > > > > > > > > > I ran the exact same query in Hive and got the following
> > > > result:
> > > > > > > > > >
> > > > > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
> > > BIGINT
> > > > > > limit
> > > > > > > > > >
> > > > > > > > > > 778,718,563    AMER  --> 922,755 difference with the
> Kylin
> > > > result
> > > > > > > > > >
> > > > > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the
> > Kylin
> > > > > > result
> > > > > > > > > >
> > > > > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> > > > result
> > > > > > > > > >
> > > > > > > > > > Based on that it seems that the limit is the int limit
> > 2^31-1
> > > > > > > > > > (2,147,483,647) and that my negative result comes from an
> > > > > overflow.
> > > > > > > > > >
> > > > > > > > > > In my cube I defined this measure as a bigint:
> > > > > > > > > >
> > > > > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > > > > >
> > > > > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > > > > >
> > > > > > > > > > The other thing that worries me a bit is that the other
> > > numbers
> > > > > are
> > > > > > > not
> > > > > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > *Bin Mahone | 马洪宾*
> > > > > > > > Apache Kylin: http://kylin.io
> > > > > > > > Github: https://github.com/binmahone
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Reply via email to