Hi, Dominique,

Thanks for the suggestion.  Yes, this would be a good idea.  It might
still be necessary to call ibis::text::selectStrings in case the .int
file is not present.  Without that option, the selectUInts function
will go through a very slow route of recovering the integers from the
bitmap index.  I will make the modification and do some testing before
check in the changes.

Thanks again.

John

PS: Gong Hei Fat Choi.  This is the first day of the year of dragon.
May you have a prosperous new year of dragon.



On 1/23/12 10:35 AM, Dominique Prunier wrote:
> Hey,
> 
> Now that ibis::relic::keys is extremely fast, i was wondering if applying the 
> attached patch wouldn't be a good idea. It replaces the selectString from 
> text column with a dictionary based version. In basic testings, it seems to 
> be much faster that the .sp based resolution.
> 
> Since i'm new to both c++ and Fastbit, please make sure that it is not 
> breaking anything (especially about locking and these kind of things) :)
> 
> Thanks,
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Dominique Prunier
> Sent: Thursday, January 19, 2012 2:12 PM
> To: K. John Wu; FastBit Users
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Cool, it seems that this sort it at least as expansive as the one on values, 
> so that would give quite a boost to queries !
> 
> Thanks,
> 
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]] 
> Sent: Thursday, January 19, 2012 2:08 PM
> To: FastBit Users
> Cc: Dominique Prunier
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Hi, Dominique,
> 
> You are right that in the spirit of SQL standard, the function sort
> for ibis::bundle1 and ibis::bundles does not need to sort the row ids
> in each segment/bundle.  It was there mostly because the first
> application this code was used demanded this particular feature.  It
> should be safe to leave the row ids in what ever order they are.  I
> will see what happens if I put a conditional macro around the section
> of code..
> 
> John
> 
> 
> On 1/19/12 10:04 AM, Dominique Prunier wrote:
>> Erf, finally impressions can be wrong sometimes. The counting sort is twice 
>> as fast as the qsort with -O0 but qsort is 50% faster with -O2...
>> Anyway, can you explain me why in bundle1::sort we have to sort the RIDs of 
>> each segment ?
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Dominique Prunier
>> Sent: Wednesday, January 18, 2012 6:24 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>> columns
>>
>> Hi,
>>
>> I implemented a very simple and naive counting sort algorithm for category 
>> columns and it ended up being twice as fast as the regular quick sort.
>> There is probably some fine tunning to do (quick sort might be faster for 
>> smaller arrays/distinct values) but it look promising.
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Dominique Prunier
>> Sent: Wednesday, January 18, 2012 1:27 PM
>> To: FastBit Users
>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>> columns
>>
>> This is working just fine now !! The createBundle now reports a time of 0.4s 
>> instead of the previous 8.7s !
>> In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm 
>> going to work on a little bit for category columns.
>> If i have something interesting, i'll obviously share it with you.
>>
>> Thanks,
>>
>> -----Original Message-----
>> From: K. John Wu [mailto:[email protected]] 
>> Sent: Wednesday, January 18, 2012 11:52 AM
>> To: FastBit Users
>> Cc: Dominique Prunier
>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>> columns
>>
>> Hi, Dominique,
>>
>> I have replaced the specific test that produced the warning message
>> with a different one that is based on the file size rather the element
>> size of the column.  Should work better this time around.
>>
>> Please give me a sample query if you still have problems with the
>> code.  Apparently, my guess of how you invoking the various functions
>> is not exact correct.
>>
>> Thanks.
>>
>> John
>>
>>
>> On 1/18/12 6:55 AM, Dominique Prunier wrote:
>>> Hey John,
>>>
>>> The .int file gets correctly generated and is perfectly equal to the one i 
>>> generated by hand.
>>> However, the selection fails because the column is of type CATEGORY, 
>>> returning a 0 size in elementSize() and warns in selectValuesT:
>>>
>>> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types
>>>
>>> Thakns,
>>>
>>> -----Original Message-----
>>> From: K. John Wu [mailto:[email protected]] 
>>> Sent: Wednesday, January 18, 2012 2:46 AM
>>> To: FastBit Users
>>> Cc: Dominique Prunier
>>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>>> columns
>>>
>>> Hi, Dominique,
>>>
>>> In this particular case, FastBit actually has some code that was
>>> commented out that generate an integer version of the categorical
>>> values.  Using these values should speed up the processing of strings
>>> in group by operations.  An update has been checked in as SVN 460.
>>> Please give it a try when you get a chance.
>>>
>>> Thanks.
>>>
>>> John
>>>
>>>
>>> On 1/17/12 7:12 PM, Dominique Prunier wrote:
>>>> Hey John,
>>>>
>>>> I can certainly provide more detail. 
>>>>
>>>> I think my test case can be simplified by simply creating one partition 
>>>> with a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct 
>>>> values or something (probably sort order has an importance too).
>>>>
>>>> What i'm doing is selecting this column (with or without a filter) through 
>>>> the query/result classes (which create a bundle internaly). The bundle 
>>>> will sort the hits using the integer representation of the string, which 
>>>> is the best decision according to me. However, geting the list of integer 
>>>> values from a hit mask seems to be a very expansive operation (more 
>>>> specifically, relic::keys). Since there is no trivial way to get the 
>>>> integer value of the column from a given row id, the keys method goes 
>>>> through every distinct values in the index and check which one matched the 
>>>> mask. This is what is hurting the perf badly compared to a plain uint 
>>>> column where the position of a value is known implicitely 
>>>> (sizeof(uint)*index of 1s).
>>>>
>>>> So i don't think there is a bad decision here rather than a missing data 
>>>> structure that allows faster mask->category int value resolution. For 
>>>> category columns, retrieving the string and then convert it back to int 
>>>> could be better but i don't think it would beat a uint column.
>>>>
>>>> My secondary test tend to prove all this by replacing my category column 
>>>> by a real uint column (using values from the dictionary). The query runs 
>>>> much faster because the cost or retrieving the uint value is very low 
>>>> compared to the cost of relic::keys.
>>
>>
>>
>>>>
>>>> Hope this is clearer.
>>>>
>>>> Thanks,
>>>>
>>>> ________________________________________
>>>> From: K. John Wu [[email protected]]
>>>> Sent: January-17-12 9:28 PM
>>>> To: FastBit Users
>>>> Cc: Dominique Prunier
>>>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>>>> columns
>>>>
>>>> Hi, Dominique,
>>>>
>>>> To get the values satisfying a query, there is a choice between
>>>> whether to get the selected values from the bitmap index or from the
>>>> original data file.  Looks like in this case, decision was wrong.  I'd
>>>> be happy to take a more careful look into how this decision is made if
>>>> you can provide me with a bit more detail..
>>>>
>>>> John
>>>>
>>>>
>>>> On 1/17/12 5:06 PM, Dominique Prunier wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I was trying to troubleshoot some performance issues and i found some
>>>>> surprising results, maybe related to the fact that i’m not so familiar
>>>>> with FastBit.
>>>>>
>>>>>
>>>>>
>>>>> I have a partition containing 2,749,086 rows. I’m selecting a category
>>>>> column with a quite selective where clause (430,906 hits over the
>>>>> 2,749,086 rows, roughly 15%).
>>>>>
>>>>>
>>>>>
>>>>> After digging up a bit, i found that most of the time is spent at
>>>>> ibis::relic::keys, which was not really what i would have expected (so
>>>>> i added a timer to measure it):
>>>>>
>>>>>
>>>>>
>>>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
>>>>> elapsed seconds
>>>>>
>>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
>>>>> seconds, 8.72055 elapsed seconds
>>>>>
>>>>>
>>>>>
>>>>> I was wondering what could be done to make this function faster.
>>>>>
>>>>>
>>>>>
>>>>> My understanding is that currently, it goes through all the bitmaps
>>>>> for each distinct value (~37,000 for this column), bitwise-and it with
>>>>> the hit vector and collects values for each matching position.
>>>>>
>>>>> What would you thing about some sort of index, similar to an actual
>>>>> uint column where each string corresponding value would be stored at
>>>>> its position.
>>>>>
>>>>> This way, i think it would be possible to collect keys much faster, at
>>>>> speed similar to ibis::column::selectUInts.
>>>>>
>>>>>
>>>>>
>>>>> To evaluate the potential speedup, i build a UINT column from the
>>>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY
>>>>> one is MUCH faster (>20x):
>>>>>
>>>>>
>>>>>
>>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
>>>>> seconds, 0.38984 elapsed seconds
>>>>>
>>>>>
>>>>>
>>>>> Do you think this is something big to implement ? Would it make sense
>>>>> for you to evaluate this ?
>>>>>
>>>>>
>>>>>
>>>>> I’m waiting for your comments on this.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> */Dominique Prunier/**//*
>>>>>
>>>>>  APG Lead Developper
>>>>>
>>>>> Logo-W4N-100dpi
>>>>>
>>>>>  4388, rue Saint-Denis
>>>>>
>>>>>  Bureau 309
>>>>>
>>>>>  Montreal (Quebec)  H2J 2L1
>>>>>
>>>>>  Tel. +1 514-842-6767  x310
>>>>>
>>>>>  Fax +1 514-842-3989
>>>>>
>>>>>  [email protected] <mailto:[email protected]>
>>>>>
>>>>>  www.watch4net.com <http://www.watch4net.com/>
>>>>>
>>>>> /  /
>>>>>
>>>>> /This message is for the designated recipient only and may contain
>>>>> privileged, proprietary, or otherwise private information. If you have
>>>>> received it in error, please notify the sender immediately and delete
>>>>> the original. Any other use of this electronic mail by you is prohibited.
>>>>>
>>>>> //Ce message est pour le récipiendaire désigné seulement et peut
>>>>> contenir des informations privilégiées, propriétaires ou autrement
>>>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
>>>>> immédiatement et effacez l'original. Toute autre utilisation de ce
>>>>> courrier électronique par vous est prohibée.///
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> FastBit-users mailing list
>>>>> [email protected]
>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> 
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to