Dominique,

Have you tested a degenerate case where (maybe dumb) the CATEGORY set is in
the 10K, 100K, 1M range ? and compared the speed/memory results.

- Jon


On Mon, Jan 23, 2012 at 10:35 AM, Dominique Prunier <
[email protected]> wrote:

> Hey,
>
> Now that ibis::relic::keys is extremely fast, i was wondering if applying
> the attached patch wouldn't be a good idea. It replaces the selectString
> from text column with a dictionary based version. In basic testings, it
> seems to be much faster that the .sp based resolution.
>
> Since i'm new to both c++ and Fastbit, please make sure that it is not
> breaking anything (especially about locking and these kind of things) :)
>
> Thanks,
>
> -----Original Message-----
> From: [email protected] [mailto:
> [email protected]] On Behalf Of Dominique Prunier
> Sent: Thursday, January 19, 2012 2:12 PM
> To: K. John Wu; FastBit Users
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
>
> Cool, it seems that this sort it at least as expansive as the one on
> values, so that would give quite a boost to queries !
>
> Thanks,
>
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]]
> Sent: Thursday, January 19, 2012 2:08 PM
> To: FastBit Users
> Cc: Dominique Prunier
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
>
> Hi, Dominique,
>
> You are right that in the spirit of SQL standard, the function sort
> for ibis::bundle1 and ibis::bundles does not need to sort the row ids
> in each segment/bundle.  It was there mostly because the first
> application this code was used demanded this particular feature.  It
> should be safe to leave the row ids in what ever order they are.  I
> will see what happens if I put a conditional macro around the section
> of code..
>
> John
>
>
> On 1/19/12 10:04 AM, Dominique Prunier wrote:
> > Erf, finally impressions can be wrong sometimes. The counting sort is
> twice as fast as the qsort with -O0 but qsort is 50% faster with -O2...
> > Anyway, can you explain me why in bundle1::sort we have to sort the RIDs
> of each segment ?
> >
> > Thanks,
> >
> > -----Original Message-----
> > From: [email protected] [mailto:
> [email protected]] On Behalf Of Dominique Prunier
> > Sent: Wednesday, January 18, 2012 6:24 PM
> > To: FastBit Users
> > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
> >
> > Hi,
> >
> > I implemented a very simple and naive counting sort algorithm for
> category columns and it ended up being twice as fast as the regular quick
> sort.
> > There is probably some fine tunning to do (quick sort might be faster
> for smaller arrays/distinct values) but it look promising.
> >
> > Thanks,
> >
> > -----Original Message-----
> > From: [email protected] [mailto:
> [email protected]] On Behalf Of Dominique Prunier
> > Sent: Wednesday, January 18, 2012 1:27 PM
> > To: FastBit Users
> > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
> >
> > This is working just fine now !! The createBundle now reports a time of
> 0.4s instead of the previous 8.7s !
> > In the 0.4s, most of the time is now consumed by the sort (0.35s), which
> i'm going to work on a little bit for category columns.
> > If i have something interesting, i'll obviously share it with you.
> >
> > Thanks,
> >
> > -----Original Message-----
> > From: K. John Wu [mailto:[email protected]]
> > Sent: Wednesday, January 18, 2012 11:52 AM
> > To: FastBit Users
> > Cc: Dominique Prunier
> > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
> >
> > Hi, Dominique,
> >
> > I have replaced the specific test that produced the warning message
> > with a different one that is based on the file size rather the element
> > size of the column.  Should work better this time around.
> >
> > Please give me a sample query if you still have problems with the
> > code.  Apparently, my guess of how you invoking the various functions
> > is not exact correct.
> >
> > Thanks.
> >
> > John
> >
> >
> > On 1/18/12 6:55 AM, Dominique Prunier wrote:
> >> Hey John,
> >>
> >> The .int file gets correctly generated and is perfectly equal to the
> one i generated by hand.
> >> However, the selection fails because the column is of type CATEGORY,
> returning a 0 size in elementSize() and warns in selectValuesT:
> >>
> >> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible
> types
> >>
> >> Thakns,
> >>
> >> -----Original Message-----
> >> From: K. John Wu [mailto:[email protected]]
> >> Sent: Wednesday, January 18, 2012 2:46 AM
> >> To: FastBit Users
> >> Cc: Dominique Prunier
> >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY
> columns
> >>
> >> Hi, Dominique,
> >>
> >> In this particular case, FastBit actually has some code that was
> >> commented out that generate an integer version of the categorical
> >> values.  Using these values should speed up the processing of strings
> >> in group by operations.  An update has been checked in as SVN 460.
> >> Please give it a try when you get a chance.
> >>
> >> Thanks.
> >>
> >> John
> >>
> >>
> >> On 1/17/12 7:12 PM, Dominique Prunier wrote:
> >>> Hey John,
> >>>
> >>> I can certainly provide more detail.
> >>>
> >>> I think my test case can be simplified by simply creating one
> partition with a single CATEGORY column with a lot of repetition, e.g. 1-2%
> distinct values or something (probably sort order has an importance too).
> >>>
> >>> What i'm doing is selecting this column (with or without a filter)
> through the query/result classes (which create a bundle internaly). The
> bundle will sort the hits using the integer representation of the string,
> which is the best decision according to me. However, geting the list of
> integer values from a hit mask seems to be a very expansive operation (more
> specifically, relic::keys). Since there is no trivial way to get the
> integer value of the column from a given row id, the keys method goes
> through every distinct values in the index and check which one matched the
> mask. This is what is hurting the perf badly compared to a plain uint
> column where the position of a value is known implicitely
> (sizeof(uint)*index of 1s).
> >>>
> >>> So i don't think there is a bad decision here rather than a missing
> data structure that allows faster mask->category int value resolution. For
> category columns, retrieving the string and then convert it back to int
> could be better but i don't think it would beat a uint column.
> >>>
> >>> My secondary test tend to prove all this by replacing my category
> column by a real uint column (using values from the dictionary). The query
> runs much faster because the cost or retrieving the uint value is very low
> compared to the cost of relic::keys.
> >
> >
> >
> >>>
> >>> Hope this is clearer.
> >>>
> >>> Thanks,
> >>>
> >>> ________________________________________
> >>> From: K. John Wu [[email protected]]
> >>> Sent: January-17-12 9:28 PM
> >>> To: FastBit Users
> >>> Cc: Dominique Prunier
> >>> Subject: Re: [FastBit-users] ibis::relic::keys performance and
> CATEGORY columns
> >>>
> >>> Hi, Dominique,
> >>>
> >>> To get the values satisfying a query, there is a choice between
> >>> whether to get the selected values from the bitmap index or from the
> >>> original data file.  Looks like in this case, decision was wrong.  I'd
> >>> be happy to take a more careful look into how this decision is made if
> >>> you can provide me with a bit more detail..
> >>>
> >>> John
> >>>
> >>>
> >>> On 1/17/12 5:06 PM, Dominique Prunier wrote:
> >>>> Hi,
> >>>>
> >>>>
> >>>>
> >>>> I was trying to troubleshoot some performance issues and i found some
> >>>> surprising results, maybe related to the fact that i’m not so familiar
> >>>> with FastBit.
> >>>>
> >>>>
> >>>>
> >>>> I have a partition containing 2,749,086 rows. I’m selecting a category
> >>>> column with a quite selective where clause (430,906 hits over the
> >>>> 2,749,086 rows, roughly 15%).
> >>>>
> >>>>
> >>>>
> >>>> After digging up a bit, i found that most of the time is spent at
> >>>> ibis::relic::keys, which was not really what i would have expected (so
> >>>> i added a timer to measure it):
> >>>>
> >>>>
> >>>>
> >>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
> >>>> elapsed seconds
> >>>>
> >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
> >>>> seconds, 8.72055 elapsed seconds
> >>>>
> >>>>
> >>>>
> >>>> I was wondering what could be done to make this function faster.
> >>>>
> >>>>
> >>>>
> >>>> My understanding is that currently, it goes through all the bitmaps
> >>>> for each distinct value (~37,000 for this column), bitwise-and it with
> >>>> the hit vector and collects values for each matching position.
> >>>>
> >>>> What would you thing about some sort of index, similar to an actual
> >>>> uint column where each string corresponding value would be stored at
> >>>> its position.
> >>>>
> >>>> This way, i think it would be possible to collect keys much faster, at
> >>>> speed similar to ibis::column::selectUInts.
> >>>>
> >>>>
> >>>>
> >>>> To evaluate the potential speedup, i build a UINT column from the
> >>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY
> >>>> one is MUCH faster (>20x):
> >>>>
> >>>>
> >>>>
> >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
> >>>> seconds, 0.38984 elapsed seconds
> >>>>
> >>>>
> >>>>
> >>>> Do you think this is something big to implement ? Would it make sense
> >>>> for you to evaluate this ?
> >>>>
> >>>>
> >>>>
> >>>> I’m waiting for your comments on this.
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>>
> >>>>
> >>>> */Dominique Prunier/**//*
> >>>>
> >>>>  APG Lead Developper
> >>>>
> >>>> Logo-W4N-100dpi
> >>>>
> >>>>  4388, rue Saint-Denis
> >>>>
> >>>>  Bureau 309
> >>>>
> >>>>  Montreal (Quebec)  H2J 2L1
> >>>>
> >>>>  Tel. +1 514-842-6767 x310
> >>>>
> >>>>  Fax +1 514-842-3989
> >>>>
> >>>>  [email protected] <mailto:
> [email protected]>
> >>>>
> >>>>  www.watch4net.com <http://www.watch4net.com/>
> >>>>
> >>>> /  /
> >>>>
> >>>> /This message is for the designated recipient only and may contain
> >>>> privileged, proprietary, or otherwise private information. If you have
> >>>> received it in error, please notify the sender immediately and delete
> >>>> the original. Any other use of this electronic mail by you is
> prohibited.
> >>>>
> >>>> //Ce message est pour le récipiendaire désigné seulement et peut
> >>>> contenir des informations privilégiées, propriétaires ou autrement
> >>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
> >>>> immédiatement et effacez l'original. Toute autre utilisation de ce
> >>>> courrier électronique par vous est prohibée.///
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> FastBit-users mailing list
> >>>> [email protected]
> >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> >>> _______________________________________________
> >>> FastBit-users mailing list
> >>> [email protected]
> >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> >> _______________________________________________
> >> FastBit-users mailing list
> >> [email protected]
> >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> > _______________________________________________
> > FastBit-users mailing list
> > [email protected]
> > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> > _______________________________________________
> > FastBit-users mailing list
> > [email protected]
> > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> > _______________________________________________
> > FastBit-users mailing list
> > [email protected]
> > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>
>
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to