So far, i tried a ~20 values CATEGORY and a 100K values CATEGORY, both of them where significantly faster (with the .int file) but i haven't analyzed the memory footprint of these. I'd expect it to be slightly higher since i'm loading first the list of keys which is converted to the list of values.
Thanks, -----Original Message----- From: K. John Wu [mailto:[email protected]] Sent: Monday, January 23, 2012 3:46 PM To: FastBit Users Cc: Dominique Prunier Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns Hi, Dominique, Thanks for the suggestion. Yes, this would be a good idea. It might still be necessary to call ibis::text::selectStrings in case the .int file is not present. Without that option, the selectUInts function will go through a very slow route of recovering the integers from the bitmap index. I will make the modification and do some testing before check in the changes. Thanks again. John PS: Gong Hei Fat Choi. This is the first day of the year of dragon. May you have a prosperous new year of dragon. On 1/23/12 10:35 AM, Dominique Prunier wrote: > Hey, > > Now that ibis::relic::keys is extremely fast, i was wondering if applying the > attached patch wouldn't be a good idea. It replaces the selectString from > text column with a dictionary based version. In basic testings, it seems to > be much faster that the .sp based resolution. > > Since i'm new to both c++ and Fastbit, please make sure that it is not > breaking anything (especially about locking and these kind of things) :) > > Thanks, > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Dominique Prunier > Sent: Thursday, January 19, 2012 2:12 PM > To: K. John Wu; FastBit Users > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Cool, it seems that this sort it at least as expansive as the one on values, > so that would give quite a boost to queries ! > > Thanks, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Thursday, January 19, 2012 2:08 PM > To: FastBit Users > Cc: Dominique Prunier > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Hi, Dominique, > > You are right that in the spirit of SQL standard, the function sort > for ibis::bundle1 and ibis::bundles does not need to sort the row ids > in each segment/bundle. It was there mostly because the first > application this code was used demanded this particular feature. It > should be safe to leave the row ids in what ever order they are. I > will see what happens if I put a conditional macro around the section > of code.. > > John > > > On 1/19/12 10:04 AM, Dominique Prunier wrote: >> Erf, finally impressions can be wrong sometimes. The counting sort is twice >> as fast as the qsort with -O0 but qsort is 50% faster with -O2... >> Anyway, can you explain me why in bundle1::sort we have to sort the RIDs of >> each segment ? >> >> Thanks, >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Dominique Prunier >> Sent: Wednesday, January 18, 2012 6:24 PM >> To: FastBit Users >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >> columns >> >> Hi, >> >> I implemented a very simple and naive counting sort algorithm for category >> columns and it ended up being twice as fast as the regular quick sort. >> There is probably some fine tunning to do (quick sort might be faster for >> smaller arrays/distinct values) but it look promising. >> >> Thanks, >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Dominique Prunier >> Sent: Wednesday, January 18, 2012 1:27 PM >> To: FastBit Users >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >> columns >> >> This is working just fine now !! The createBundle now reports a time of 0.4s >> instead of the previous 8.7s ! >> In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm >> going to work on a little bit for category columns. >> If i have something interesting, i'll obviously share it with you. >> >> Thanks, >> >> -----Original Message----- >> From: K. John Wu [mailto:[email protected]] >> Sent: Wednesday, January 18, 2012 11:52 AM >> To: FastBit Users >> Cc: Dominique Prunier >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >> columns >> >> Hi, Dominique, >> >> I have replaced the specific test that produced the warning message >> with a different one that is based on the file size rather the element >> size of the column. Should work better this time around. >> >> Please give me a sample query if you still have problems with the >> code. Apparently, my guess of how you invoking the various functions >> is not exact correct. >> >> Thanks. >> >> John >> >> >> On 1/18/12 6:55 AM, Dominique Prunier wrote: >>> Hey John, >>> >>> The .int file gets correctly generated and is perfectly equal to the one i >>> generated by hand. >>> However, the selection fails because the column is of type CATEGORY, >>> returning a 0 size in elementSize() and warns in selectValuesT: >>> >>> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types >>> >>> Thakns, >>> >>> -----Original Message----- >>> From: K. John Wu [mailto:[email protected]] >>> Sent: Wednesday, January 18, 2012 2:46 AM >>> To: FastBit Users >>> Cc: Dominique Prunier >>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >>> columns >>> >>> Hi, Dominique, >>> >>> In this particular case, FastBit actually has some code that was >>> commented out that generate an integer version of the categorical >>> values. Using these values should speed up the processing of strings >>> in group by operations. An update has been checked in as SVN 460. >>> Please give it a try when you get a chance. >>> >>> Thanks. >>> >>> John >>> >>> >>> On 1/17/12 7:12 PM, Dominique Prunier wrote: >>>> Hey John, >>>> >>>> I can certainly provide more detail. >>>> >>>> I think my test case can be simplified by simply creating one partition >>>> with a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct >>>> values or something (probably sort order has an importance too). >>>> >>>> What i'm doing is selecting this column (with or without a filter) through >>>> the query/result classes (which create a bundle internaly). The bundle >>>> will sort the hits using the integer representation of the string, which >>>> is the best decision according to me. However, geting the list of integer >>>> values from a hit mask seems to be a very expansive operation (more >>>> specifically, relic::keys). Since there is no trivial way to get the >>>> integer value of the column from a given row id, the keys method goes >>>> through every distinct values in the index and check which one matched the >>>> mask. This is what is hurting the perf badly compared to a plain uint >>>> column where the position of a value is known implicitely >>>> (sizeof(uint)*index of 1s). >>>> >>>> So i don't think there is a bad decision here rather than a missing data >>>> structure that allows faster mask->category int value resolution. For >>>> category columns, retrieving the string and then convert it back to int >>>> could be better but i don't think it would beat a uint column. >>>> >>>> My secondary test tend to prove all this by replacing my category column >>>> by a real uint column (using values from the dictionary). The query runs >>>> much faster because the cost or retrieving the uint value is very low >>>> compared to the cost of relic::keys. >> >> >> >>>> >>>> Hope this is clearer. >>>> >>>> Thanks, >>>> >>>> ________________________________________ >>>> From: K. John Wu [[email protected]] >>>> Sent: January-17-12 9:28 PM >>>> To: FastBit Users >>>> Cc: Dominique Prunier >>>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >>>> columns >>>> >>>> Hi, Dominique, >>>> >>>> To get the values satisfying a query, there is a choice between >>>> whether to get the selected values from the bitmap index or from the >>>> original data file. Looks like in this case, decision was wrong. I'd >>>> be happy to take a more careful look into how this decision is made if >>>> you can provide me with a bit more detail.. >>>> >>>> John >>>> >>>> >>>> On 1/17/12 5:06 PM, Dominique Prunier wrote: >>>>> Hi, >>>>> >>>>> >>>>> >>>>> I was trying to troubleshoot some performance issues and i found some >>>>> surprising results, maybe related to the fact that i’m not so familiar >>>>> with FastBit. >>>>> >>>>> >>>>> >>>>> I have a partition containing 2,749,086 rows. I’m selecting a category >>>>> column with a quite selective where clause (430,906 hits over the >>>>> 2,749,086 rows, roughly 15%). >>>>> >>>>> >>>>> >>>>> After digging up a bit, i found that most of the time is spent at >>>>> ibis::relic::keys, which was not really what i would have expected (so >>>>> i added a timer to measure it): >>>>> >>>>> >>>>> >>>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573 >>>>> elapsed seconds >>>>> >>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU >>>>> seconds, 8.72055 elapsed seconds >>>>> >>>>> >>>>> >>>>> I was wondering what could be done to make this function faster. >>>>> >>>>> >>>>> >>>>> My understanding is that currently, it goes through all the bitmaps >>>>> for each distinct value (~37,000 for this column), bitwise-and it with >>>>> the hit vector and collects values for each matching position. >>>>> >>>>> What would you thing about some sort of index, similar to an actual >>>>> uint column where each string corresponding value would be stored at >>>>> its position. >>>>> >>>>> This way, i think it would be possible to collect keys much faster, at >>>>> speed similar to ibis::column::selectUInts. >>>>> >>>>> >>>>> >>>>> To evaluate the potential speedup, i build a UINT column from the >>>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY >>>>> one is MUCH faster (>20x): >>>>> >>>>> >>>>> >>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU >>>>> seconds, 0.38984 elapsed seconds >>>>> >>>>> >>>>> >>>>> Do you think this is something big to implement ? Would it make sense >>>>> for you to evaluate this ? >>>>> >>>>> >>>>> >>>>> I’m waiting for your comments on this. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> */Dominique Prunier/**//* >>>>> >>>>> APG Lead Developper >>>>> >>>>> Logo-W4N-100dpi >>>>> >>>>> 4388, rue Saint-Denis >>>>> >>>>> Bureau 309 >>>>> >>>>> Montreal (Quebec) H2J 2L1 >>>>> >>>>> Tel. +1 514-842-6767 x310 >>>>> >>>>> Fax +1 514-842-3989 >>>>> >>>>> [email protected] <mailto:[email protected]> >>>>> >>>>> www.watch4net.com <http://www.watch4net.com/> >>>>> >>>>> / / >>>>> >>>>> /This message is for the designated recipient only and may contain >>>>> privileged, proprietary, or otherwise private information. If you have >>>>> received it in error, please notify the sender immediately and delete >>>>> the original. Any other use of this electronic mail by you is prohibited. >>>>> >>>>> //Ce message est pour le récipiendaire désigné seulement et peut >>>>> contenir des informations privilégiées, propriétaires ou autrement >>>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur >>>>> immédiatement et effacez l'original. Toute autre utilisation de ce >>>>> courrier électronique par vous est prohibée./// >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> FastBit-users mailing list >>>>> [email protected] >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
