Dominique, Have you tested a degenerate case where (maybe dumb) the CATEGORY set is in the 10K, 100K, 1M range ? and compared the speed/memory results.
- Jon On Mon, Jan 23, 2012 at 10:35 AM, Dominique Prunier < [email protected]> wrote: > Hey, > > Now that ibis::relic::keys is extremely fast, i was wondering if applying > the attached patch wouldn't be a good idea. It replaces the selectString > from text column with a dictionary based version. In basic testings, it > seems to be much faster that the .sp based resolution. > > Since i'm new to both c++ and Fastbit, please make sure that it is not > breaking anything (especially about locking and these kind of things) :) > > Thanks, > > -----Original Message----- > From: [email protected] [mailto: > [email protected]] On Behalf Of Dominique Prunier > Sent: Thursday, January 19, 2012 2:12 PM > To: K. John Wu; FastBit Users > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Cool, it seems that this sort it at least as expansive as the one on > values, so that would give quite a boost to queries ! > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Thursday, January 19, 2012 2:08 PM > To: FastBit Users > Cc: Dominique Prunier > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Hi, Dominique, > > You are right that in the spirit of SQL standard, the function sort > for ibis::bundle1 and ibis::bundles does not need to sort the row ids > in each segment/bundle. It was there mostly because the first > application this code was used demanded this particular feature. It > should be safe to leave the row ids in what ever order they are. I > will see what happens if I put a conditional macro around the section > of code.. > > John > > > On 1/19/12 10:04 AM, Dominique Prunier wrote: > > Erf, finally impressions can be wrong sometimes. The counting sort is > twice as fast as the qsort with -O0 but qsort is 50% faster with -O2... > > Anyway, can you explain me why in bundle1::sort we have to sort the RIDs > of each segment ? > > > > Thanks, > > > > -----Original Message----- > > From: [email protected] [mailto: > [email protected]] On Behalf Of Dominique Prunier > > Sent: Wednesday, January 18, 2012 6:24 PM > > To: FastBit Users > > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > > > Hi, > > > > I implemented a very simple and naive counting sort algorithm for > category columns and it ended up being twice as fast as the regular quick > sort. > > There is probably some fine tunning to do (quick sort might be faster > for smaller arrays/distinct values) but it look promising. > > > > Thanks, > > > > -----Original Message----- > > From: [email protected] [mailto: > [email protected]] On Behalf Of Dominique Prunier > > Sent: Wednesday, January 18, 2012 1:27 PM > > To: FastBit Users > > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > > > This is working just fine now !! The createBundle now reports a time of > 0.4s instead of the previous 8.7s ! > > In the 0.4s, most of the time is now consumed by the sort (0.35s), which > i'm going to work on a little bit for category columns. > > If i have something interesting, i'll obviously share it with you. > > > > Thanks, > > > > -----Original Message----- > > From: K. John Wu [mailto:[email protected]] > > Sent: Wednesday, January 18, 2012 11:52 AM > > To: FastBit Users > > Cc: Dominique Prunier > > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > > > Hi, Dominique, > > > > I have replaced the specific test that produced the warning message > > with a different one that is based on the file size rather the element > > size of the column. Should work better this time around. > > > > Please give me a sample query if you still have problems with the > > code. Apparently, my guess of how you invoking the various functions > > is not exact correct. > > > > Thanks. > > > > John > > > > > > On 1/18/12 6:55 AM, Dominique Prunier wrote: > >> Hey John, > >> > >> The .int file gets correctly generated and is perfectly equal to the > one i generated by hand. > >> However, the selection fails because the column is of type CATEGORY, > returning a 0 size in elementSize() and warns in selectValuesT: > >> > >> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible > types > >> > >> Thakns, > >> > >> -----Original Message----- > >> From: K. John Wu [mailto:[email protected]] > >> Sent: Wednesday, January 18, 2012 2:46 AM > >> To: FastBit Users > >> Cc: Dominique Prunier > >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > >> > >> Hi, Dominique, > >> > >> In this particular case, FastBit actually has some code that was > >> commented out that generate an integer version of the categorical > >> values. Using these values should speed up the processing of strings > >> in group by operations. An update has been checked in as SVN 460. > >> Please give it a try when you get a chance. > >> > >> Thanks. > >> > >> John > >> > >> > >> On 1/17/12 7:12 PM, Dominique Prunier wrote: > >>> Hey John, > >>> > >>> I can certainly provide more detail. > >>> > >>> I think my test case can be simplified by simply creating one > partition with a single CATEGORY column with a lot of repetition, e.g. 1-2% > distinct values or something (probably sort order has an importance too). > >>> > >>> What i'm doing is selecting this column (with or without a filter) > through the query/result classes (which create a bundle internaly). The > bundle will sort the hits using the integer representation of the string, > which is the best decision according to me. However, geting the list of > integer values from a hit mask seems to be a very expansive operation (more > specifically, relic::keys). Since there is no trivial way to get the > integer value of the column from a given row id, the keys method goes > through every distinct values in the index and check which one matched the > mask. This is what is hurting the perf badly compared to a plain uint > column where the position of a value is known implicitely > (sizeof(uint)*index of 1s). > >>> > >>> So i don't think there is a bad decision here rather than a missing > data structure that allows faster mask->category int value resolution. For > category columns, retrieving the string and then convert it back to int > could be better but i don't think it would beat a uint column. > >>> > >>> My secondary test tend to prove all this by replacing my category > column by a real uint column (using values from the dictionary). The query > runs much faster because the cost or retrieving the uint value is very low > compared to the cost of relic::keys. > > > > > > > >>> > >>> Hope this is clearer. > >>> > >>> Thanks, > >>> > >>> ________________________________________ > >>> From: K. John Wu [[email protected]] > >>> Sent: January-17-12 9:28 PM > >>> To: FastBit Users > >>> Cc: Dominique Prunier > >>> Subject: Re: [FastBit-users] ibis::relic::keys performance and > CATEGORY columns > >>> > >>> Hi, Dominique, > >>> > >>> To get the values satisfying a query, there is a choice between > >>> whether to get the selected values from the bitmap index or from the > >>> original data file. Looks like in this case, decision was wrong. I'd > >>> be happy to take a more careful look into how this decision is made if > >>> you can provide me with a bit more detail.. > >>> > >>> John > >>> > >>> > >>> On 1/17/12 5:06 PM, Dominique Prunier wrote: > >>>> Hi, > >>>> > >>>> > >>>> > >>>> I was trying to troubleshoot some performance issues and i found some > >>>> surprising results, maybe related to the fact that i’m not so familiar > >>>> with FastBit. > >>>> > >>>> > >>>> > >>>> I have a partition containing 2,749,086 rows. I’m selecting a category > >>>> column with a quite selective where clause (430,906 hits over the > >>>> 2,749,086 rows, roughly 15%). > >>>> > >>>> > >>>> > >>>> After digging up a bit, i found that most of the time is spent at > >>>> ibis::relic::keys, which was not really what i would have expected (so > >>>> i added a timer to measure it): > >>>> > >>>> > >>>> > >>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573 > >>>> elapsed seconds > >>>> > >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU > >>>> seconds, 8.72055 elapsed seconds > >>>> > >>>> > >>>> > >>>> I was wondering what could be done to make this function faster. > >>>> > >>>> > >>>> > >>>> My understanding is that currently, it goes through all the bitmaps > >>>> for each distinct value (~37,000 for this column), bitwise-and it with > >>>> the hit vector and collects values for each matching position. > >>>> > >>>> What would you thing about some sort of index, similar to an actual > >>>> uint column where each string corresponding value would be stored at > >>>> its position. > >>>> > >>>> This way, i think it would be possible to collect keys much faster, at > >>>> speed similar to ibis::column::selectUInts. > >>>> > >>>> > >>>> > >>>> To evaluate the potential speedup, i build a UINT column from the > >>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY > >>>> one is MUCH faster (>20x): > >>>> > >>>> > >>>> > >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU > >>>> seconds, 0.38984 elapsed seconds > >>>> > >>>> > >>>> > >>>> Do you think this is something big to implement ? Would it make sense > >>>> for you to evaluate this ? > >>>> > >>>> > >>>> > >>>> I’m waiting for your comments on this. > >>>> > >>>> > >>>> > >>>> Thanks, > >>>> > >>>> > >>>> > >>>> */Dominique Prunier/**//* > >>>> > >>>> APG Lead Developper > >>>> > >>>> Logo-W4N-100dpi > >>>> > >>>> 4388, rue Saint-Denis > >>>> > >>>> Bureau 309 > >>>> > >>>> Montreal (Quebec) H2J 2L1 > >>>> > >>>> Tel. +1 514-842-6767 x310 > >>>> > >>>> Fax +1 514-842-3989 > >>>> > >>>> [email protected] <mailto: > [email protected]> > >>>> > >>>> www.watch4net.com <http://www.watch4net.com/> > >>>> > >>>> / / > >>>> > >>>> /This message is for the designated recipient only and may contain > >>>> privileged, proprietary, or otherwise private information. If you have > >>>> received it in error, please notify the sender immediately and delete > >>>> the original. Any other use of this electronic mail by you is > prohibited. > >>>> > >>>> //Ce message est pour le récipiendaire désigné seulement et peut > >>>> contenir des informations privilégiées, propriétaires ou autrement > >>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur > >>>> immédiatement et effacez l'original. Toute autre utilisation de ce > >>>> courrier électronique par vous est prohibée./// > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> FastBit-users mailing list > >>>> [email protected] > >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > >>> _______________________________________________ > >>> FastBit-users mailing list > >>> [email protected] > >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > >> _______________________________________________ > >> FastBit-users mailing list > >> [email protected] > >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > > FastBit-users mailing list > > [email protected] > > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > > FastBit-users mailing list > > [email protected] > > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > > FastBit-users mailing list > > [email protected] > > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > >
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
