Hi, Dominique, You are right that in the spirit of SQL standard, the function sort for ibis::bundle1 and ibis::bundles does not need to sort the row ids in each segment/bundle. It was there mostly because the first application this code was used demanded this particular feature. It should be safe to leave the row ids in what ever order they are. I will see what happens if I put a conditional macro around the section of code..
John On 1/19/12 10:04 AM, Dominique Prunier wrote: > Erf, finally impressions can be wrong sometimes. The counting sort is twice > as fast as the qsort with -O0 but qsort is 50% faster with -O2... > Anyway, can you explain me why in bundle1::sort we have to sort the RIDs of > each segment ? > > Thanks, > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Dominique Prunier > Sent: Wednesday, January 18, 2012 6:24 PM > To: FastBit Users > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Hi, > > I implemented a very simple and naive counting sort algorithm for category > columns and it ended up being twice as fast as the regular quick sort. > There is probably some fine tunning to do (quick sort might be faster for > smaller arrays/distinct values) but it look promising. > > Thanks, > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Dominique Prunier > Sent: Wednesday, January 18, 2012 1:27 PM > To: FastBit Users > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > This is working just fine now !! The createBundle now reports a time of 0.4s > instead of the previous 8.7s ! > In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm > going to work on a little bit for category columns. > If i have something interesting, i'll obviously share it with you. > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Wednesday, January 18, 2012 11:52 AM > To: FastBit Users > Cc: Dominique Prunier > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Hi, Dominique, > > I have replaced the specific test that produced the warning message > with a different one that is based on the file size rather the element > size of the column. Should work better this time around. > > Please give me a sample query if you still have problems with the > code. Apparently, my guess of how you invoking the various functions > is not exact correct. > > Thanks. > > John > > > On 1/18/12 6:55 AM, Dominique Prunier wrote: >> Hey John, >> >> The .int file gets correctly generated and is perfectly equal to the one i >> generated by hand. >> However, the selection fails because the column is of type CATEGORY, >> returning a 0 size in elementSize() and warns in selectValuesT: >> >> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types >> >> Thakns, >> >> -----Original Message----- >> From: K. John Wu [mailto:[email protected]] >> Sent: Wednesday, January 18, 2012 2:46 AM >> To: FastBit Users >> Cc: Dominique Prunier >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >> columns >> >> Hi, Dominique, >> >> In this particular case, FastBit actually has some code that was >> commented out that generate an integer version of the categorical >> values. Using these values should speed up the processing of strings >> in group by operations. An update has been checked in as SVN 460. >> Please give it a try when you get a chance. >> >> Thanks. >> >> John >> >> >> On 1/17/12 7:12 PM, Dominique Prunier wrote: >>> Hey John, >>> >>> I can certainly provide more detail. >>> >>> I think my test case can be simplified by simply creating one partition >>> with a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct >>> values or something (probably sort order has an importance too). >>> >>> What i'm doing is selecting this column (with or without a filter) through >>> the query/result classes (which create a bundle internaly). The bundle will >>> sort the hits using the integer representation of the string, which is the >>> best decision according to me. However, geting the list of integer values >>> from a hit mask seems to be a very expansive operation (more specifically, >>> relic::keys). Since there is no trivial way to get the integer value of the >>> column from a given row id, the keys method goes through every distinct >>> values in the index and check which one matched the mask. This is what is >>> hurting the perf badly compared to a plain uint column where the position >>> of a value is known implicitely (sizeof(uint)*index of 1s). >>> >>> So i don't think there is a bad decision here rather than a missing data >>> structure that allows faster mask->category int value resolution. For >>> category columns, retrieving the string and then convert it back to int >>> could be better but i don't think it would beat a uint column. >>> >>> My secondary test tend to prove all this by replacing my category column by >>> a real uint column (using values from the dictionary). The query runs much >>> faster because the cost or retrieving the uint value is very low compared >>> to the cost of relic::keys. > > > >>> >>> Hope this is clearer. >>> >>> Thanks, >>> >>> ________________________________________ >>> From: K. John Wu [[email protected]] >>> Sent: January-17-12 9:28 PM >>> To: FastBit Users >>> Cc: Dominique Prunier >>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >>> columns >>> >>> Hi, Dominique, >>> >>> To get the values satisfying a query, there is a choice between >>> whether to get the selected values from the bitmap index or from the >>> original data file. Looks like in this case, decision was wrong. I'd >>> be happy to take a more careful look into how this decision is made if >>> you can provide me with a bit more detail.. >>> >>> John >>> >>> >>> On 1/17/12 5:06 PM, Dominique Prunier wrote: >>>> Hi, >>>> >>>> >>>> >>>> I was trying to troubleshoot some performance issues and i found some >>>> surprising results, maybe related to the fact that i’m not so familiar >>>> with FastBit. >>>> >>>> >>>> >>>> I have a partition containing 2,749,086 rows. I’m selecting a category >>>> column with a quite selective where clause (430,906 hits over the >>>> 2,749,086 rows, roughly 15%). >>>> >>>> >>>> >>>> After digging up a bit, i found that most of the time is spent at >>>> ibis::relic::keys, which was not really what i would have expected (so >>>> i added a timer to measure it): >>>> >>>> >>>> >>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573 >>>> elapsed seconds >>>> >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU >>>> seconds, 8.72055 elapsed seconds >>>> >>>> >>>> >>>> I was wondering what could be done to make this function faster. >>>> >>>> >>>> >>>> My understanding is that currently, it goes through all the bitmaps >>>> for each distinct value (~37,000 for this column), bitwise-and it with >>>> the hit vector and collects values for each matching position. >>>> >>>> What would you thing about some sort of index, similar to an actual >>>> uint column where each string corresponding value would be stored at >>>> its position. >>>> >>>> This way, i think it would be possible to collect keys much faster, at >>>> speed similar to ibis::column::selectUInts. >>>> >>>> >>>> >>>> To evaluate the potential speedup, i build a UINT column from the >>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY >>>> one is MUCH faster (>20x): >>>> >>>> >>>> >>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU >>>> seconds, 0.38984 elapsed seconds >>>> >>>> >>>> >>>> Do you think this is something big to implement ? Would it make sense >>>> for you to evaluate this ? >>>> >>>> >>>> >>>> I’m waiting for your comments on this. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> >>>> >>>> */Dominique Prunier/**//* >>>> >>>> APG Lead Developper >>>> >>>> Logo-W4N-100dpi >>>> >>>> 4388, rue Saint-Denis >>>> >>>> Bureau 309 >>>> >>>> Montreal (Quebec) H2J 2L1 >>>> >>>> Tel. +1 514-842-6767 x310 >>>> >>>> Fax +1 514-842-3989 >>>> >>>> [email protected] <mailto:[email protected]> >>>> >>>> www.watch4net.com <http://www.watch4net.com/> >>>> >>>> / / >>>> >>>> /This message is for the designated recipient only and may contain >>>> privileged, proprietary, or otherwise private information. If you have >>>> received it in error, please notify the sender immediately and delete >>>> the original. Any other use of this electronic mail by you is prohibited. >>>> >>>> //Ce message est pour le récipiendaire désigné seulement et peut >>>> contenir des informations privilégiées, propriétaires ou autrement >>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur >>>> immédiatement et effacez l'original. Toute autre utilisation de ce >>>> courrier électronique par vous est prohibée./// >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
