This is working just fine now !! The createBundle now reports a time of 0.4s instead of the previous 8.7s ! In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm going to work on a little bit for category columns. If i have something interesting, i'll obviously share it with you.
Thanks, -----Original Message----- From: K. John Wu [mailto:[email protected]] Sent: Wednesday, January 18, 2012 11:52 AM To: FastBit Users Cc: Dominique Prunier Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns Hi, Dominique, I have replaced the specific test that produced the warning message with a different one that is based on the file size rather the element size of the column. Should work better this time around. Please give me a sample query if you still have problems with the code. Apparently, my guess of how you invoking the various functions is not exact correct. Thanks. John On 1/18/12 6:55 AM, Dominique Prunier wrote: > Hey John, > > The .int file gets correctly generated and is perfectly equal to the one i > generated by hand. > However, the selection fails because the column is of type CATEGORY, > returning a 0 size in elementSize() and warns in selectValuesT: > > Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types > > Thakns, > > -----Original Message----- > From: K. John Wu [mailto:[email protected]] > Sent: Wednesday, January 18, 2012 2:46 AM > To: FastBit Users > Cc: Dominique Prunier > Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY > columns > > Hi, Dominique, > > In this particular case, FastBit actually has some code that was > commented out that generate an integer version of the categorical > values. Using these values should speed up the processing of strings > in group by operations. An update has been checked in as SVN 460. > Please give it a try when you get a chance. > > Thanks. > > John > > > On 1/17/12 7:12 PM, Dominique Prunier wrote: >> Hey John, >> >> I can certainly provide more detail. >> >> I think my test case can be simplified by simply creating one partition with >> a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct values >> or something (probably sort order has an importance too). >> >> What i'm doing is selecting this column (with or without a filter) through >> the query/result classes (which create a bundle internaly). The bundle will >> sort the hits using the integer representation of the string, which is the >> best decision according to me. However, geting the list of integer values >> from a hit mask seems to be a very expansive operation (more specifically, >> relic::keys). Since there is no trivial way to get the integer value of the >> column from a given row id, the keys method goes through every distinct >> values in the index and check which one matched the mask. This is what is >> hurting the perf badly compared to a plain uint column where the position of >> a value is known implicitely (sizeof(uint)*index of 1s). >> >> So i don't think there is a bad decision here rather than a missing data >> structure that allows faster mask->category int value resolution. For >> category columns, retrieving the string and then convert it back to int >> could be better but i don't think it would beat a uint column. >> >> My secondary test tend to prove all this by replacing my category column by >> a real uint column (using values from the dictionary). The query runs much >> faster because the cost or retrieving the uint value is very low compared to >> the cost of relic::keys. >> >> Hope this is clearer. >> >> Thanks, >> >> ________________________________________ >> From: K. John Wu [[email protected]] >> Sent: January-17-12 9:28 PM >> To: FastBit Users >> Cc: Dominique Prunier >> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY >> columns >> >> Hi, Dominique, >> >> To get the values satisfying a query, there is a choice between >> whether to get the selected values from the bitmap index or from the >> original data file. Looks like in this case, decision was wrong. I'd >> be happy to take a more careful look into how this decision is made if >> you can provide me with a bit more detail.. >> >> John >> >> >> On 1/17/12 5:06 PM, Dominique Prunier wrote: >>> Hi, >>> >>> >>> >>> I was trying to troubleshoot some performance issues and i found some >>> surprising results, maybe related to the fact that i’m not so familiar >>> with FastBit. >>> >>> >>> >>> I have a partition containing 2,749,086 rows. I’m selecting a category >>> column with a quite selective where clause (430,906 hits over the >>> 2,749,086 rows, roughly 15%). >>> >>> >>> >>> After digging up a bit, i found that most of the time is spent at >>> ibis::relic::keys, which was not really what i would have expected (so >>> i added a timer to measure it): >>> >>> >>> >>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573 >>> elapsed seconds >>> >>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU >>> seconds, 8.72055 elapsed seconds >>> >>> >>> >>> I was wondering what could be done to make this function faster. >>> >>> >>> >>> My understanding is that currently, it goes through all the bitmaps >>> for each distinct value (~37,000 for this column), bitwise-and it with >>> the hit vector and collects values for each matching position. >>> >>> What would you thing about some sort of index, similar to an actual >>> uint column where each string corresponding value would be stored at >>> its position. >>> >>> This way, i think it would be possible to collect keys much faster, at >>> speed similar to ibis::column::selectUInts. >>> >>> >>> >>> To evaluate the potential speedup, i build a UINT column from the >>> CATEGORY column, and selecting the UINT column instead of the CATEGORY >>> one is MUCH faster (>20x): >>> >>> >>> >>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU >>> seconds, 0.38984 elapsed seconds >>> >>> >>> >>> Do you think this is something big to implement ? Would it make sense >>> for you to evaluate this ? >>> >>> >>> >>> I’m waiting for your comments on this. >>> >>> >>> >>> Thanks, >>> >>> >>> >>> */Dominique Prunier/**//* >>> >>> APG Lead Developper >>> >>> Logo-W4N-100dpi >>> >>> 4388, rue Saint-Denis >>> >>> Bureau 309 >>> >>> Montreal (Quebec) H2J 2L1 >>> >>> Tel. +1 514-842-6767 x310 >>> >>> Fax +1 514-842-3989 >>> >>> [email protected] <mailto:[email protected]> >>> >>> www.watch4net.com <http://www.watch4net.com/> >>> >>> / / >>> >>> /This message is for the designated recipient only and may contain >>> privileged, proprietary, or otherwise private information. If you have >>> received it in error, please notify the sender immediately and delete >>> the original. Any other use of this electronic mail by you is prohibited. >>> >>> //Ce message est pour le récipiendaire désigné seulement et peut >>> contenir des informations privilégiées, propriétaires ou autrement >>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur >>> immédiatement et effacez l'original. Toute autre utilisation de ce >>> courrier électronique par vous est prohibée./// >>> >>> >>> >>> >>> >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
