Cool, it seems that this sort it at least as expansive as the one on values, so 
that would give quite a boost to queries !

Thanks,

-----Original Message-----
From: K. John Wu [mailto:[email protected]] 
Sent: Thursday, January 19, 2012 2:08 PM
To: FastBit Users
Cc: Dominique Prunier
Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns

Hi, Dominique,

You are right that in the spirit of SQL standard, the function sort
for ibis::bundle1 and ibis::bundles does not need to sort the row ids
in each segment/bundle.  It was there mostly because the first
application this code was used demanded this particular feature.  It
should be safe to leave the row ids in what ever order they are.  I
will see what happens if I put a conditional macro around the section
of code..

John


On 1/19/12 10:04 AM, Dominique Prunier wrote:
> Erf, finally impressions can be wrong sometimes. The counting sort is twice 
> as fast as the qsort with -O0 but qsort is 50% faster with -O2...
> Anyway, can you explain me why in bundle1::sort we have to sort the RIDs of 
> each segment ?
> 
> Thanks,
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Dominique Prunier
> Sent: Wednesday, January 18, 2012 6:24 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Hi,
> 
> I implemented a very simple and naive counting sort algorithm for category 
> columns and it ended up being twice as fast as the regular quick sort.
> There is probably some fine tunning to do (quick sort might be faster for 
> smaller arrays/distinct values) but it look promising.
> 
> Thanks,
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Dominique Prunier
> Sent: Wednesday, January 18, 2012 1:27 PM
> To: FastBit Users
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> This is working just fine now !! The createBundle now reports a time of 0.4s 
> instead of the previous 8.7s !
> In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm 
> going to work on a little bit for category columns.
> If i have something interesting, i'll obviously share it with you.
> 
> Thanks,
> 
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]] 
> Sent: Wednesday, January 18, 2012 11:52 AM
> To: FastBit Users
> Cc: Dominique Prunier
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Hi, Dominique,
> 
> I have replaced the specific test that produced the warning message
> with a different one that is based on the file size rather the element
> size of the column.  Should work better this time around.
> 
> Please give me a sample query if you still have problems with the
> code.  Apparently, my guess of how you invoking the various functions
> is not exact correct.
> 
> Thanks.
> 
> John
> 
> 
> On 1/18/12 6:55 AM, Dominique Prunier wrote:
>> Hey John,
>>
>> The .int file gets correctly generated and is perfectly equal to the one i 
>> generated by hand.
>> However, the selection fails because the column is of type CATEGORY, 
>> returning a 0 size in elementSize() and warns in selectValuesT:
>>
>> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types
>>
>> Thakns,
>>
>> -----Original Message-----
>> From: K. John Wu [mailto:[email protected]] 
>> Sent: Wednesday, January 18, 2012 2:46 AM
>> To: FastBit Users
>> Cc: Dominique Prunier
>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>> columns
>>
>> Hi, Dominique,
>>
>> In this particular case, FastBit actually has some code that was
>> commented out that generate an integer version of the categorical
>> values.  Using these values should speed up the processing of strings
>> in group by operations.  An update has been checked in as SVN 460.
>> Please give it a try when you get a chance.
>>
>> Thanks.
>>
>> John
>>
>>
>> On 1/17/12 7:12 PM, Dominique Prunier wrote:
>>> Hey John,
>>>
>>> I can certainly provide more detail. 
>>>
>>> I think my test case can be simplified by simply creating one partition 
>>> with a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct 
>>> values or something (probably sort order has an importance too).
>>>
>>> What i'm doing is selecting this column (with or without a filter) through 
>>> the query/result classes (which create a bundle internaly). The bundle will 
>>> sort the hits using the integer representation of the string, which is the 
>>> best decision according to me. However, geting the list of integer values 
>>> from a hit mask seems to be a very expansive operation (more specifically, 
>>> relic::keys). Since there is no trivial way to get the integer value of the 
>>> column from a given row id, the keys method goes through every distinct 
>>> values in the index and check which one matched the mask. This is what is 
>>> hurting the perf badly compared to a plain uint column where the position 
>>> of a value is known implicitely (sizeof(uint)*index of 1s).
>>>
>>> So i don't think there is a bad decision here rather than a missing data 
>>> structure that allows faster mask->category int value resolution. For 
>>> category columns, retrieving the string and then convert it back to int 
>>> could be better but i don't think it would beat a uint column.
>>>
>>> My secondary test tend to prove all this by replacing my category column by 
>>> a real uint column (using values from the dictionary). The query runs much 
>>> faster because the cost or retrieving the uint value is very low compared 
>>> to the cost of relic::keys.
> 
> 
> 
>>>
>>> Hope this is clearer.
>>>
>>> Thanks,
>>>
>>> ________________________________________
>>> From: K. John Wu [[email protected]]
>>> Sent: January-17-12 9:28 PM
>>> To: FastBit Users
>>> Cc: Dominique Prunier
>>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>>> columns
>>>
>>> Hi, Dominique,
>>>
>>> To get the values satisfying a query, there is a choice between
>>> whether to get the selected values from the bitmap index or from the
>>> original data file.  Looks like in this case, decision was wrong.  I'd
>>> be happy to take a more careful look into how this decision is made if
>>> you can provide me with a bit more detail..
>>>
>>> John
>>>
>>>
>>> On 1/17/12 5:06 PM, Dominique Prunier wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I was trying to troubleshoot some performance issues and i found some
>>>> surprising results, maybe related to the fact that i’m not so familiar
>>>> with FastBit.
>>>>
>>>>
>>>>
>>>> I have a partition containing 2,749,086 rows. I’m selecting a category
>>>> column with a quite selective where clause (430,906 hits over the
>>>> 2,749,086 rows, roughly 15%).
>>>>
>>>>
>>>>
>>>> After digging up a bit, i found that most of the time is spent at
>>>> ibis::relic::keys, which was not really what i would have expected (so
>>>> i added a timer to measure it):
>>>>
>>>>
>>>>
>>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
>>>> elapsed seconds
>>>>
>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
>>>> seconds, 8.72055 elapsed seconds
>>>>
>>>>
>>>>
>>>> I was wondering what could be done to make this function faster.
>>>>
>>>>
>>>>
>>>> My understanding is that currently, it goes through all the bitmaps
>>>> for each distinct value (~37,000 for this column), bitwise-and it with
>>>> the hit vector and collects values for each matching position.
>>>>
>>>> What would you thing about some sort of index, similar to an actual
>>>> uint column where each string corresponding value would be stored at
>>>> its position.
>>>>
>>>> This way, i think it would be possible to collect keys much faster, at
>>>> speed similar to ibis::column::selectUInts.
>>>>
>>>>
>>>>
>>>> To evaluate the potential speedup, i build a UINT column from the
>>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY
>>>> one is MUCH faster (>20x):
>>>>
>>>>
>>>>
>>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
>>>> seconds, 0.38984 elapsed seconds
>>>>
>>>>
>>>>
>>>> Do you think this is something big to implement ? Would it make sense
>>>> for you to evaluate this ?
>>>>
>>>>
>>>>
>>>> I’m waiting for your comments on this.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> */Dominique Prunier/**//*
>>>>
>>>>  APG Lead Developper
>>>>
>>>> Logo-W4N-100dpi
>>>>
>>>>  4388, rue Saint-Denis
>>>>
>>>>  Bureau 309
>>>>
>>>>  Montreal (Quebec)  H2J 2L1
>>>>
>>>>  Tel. +1 514-842-6767  x310
>>>>
>>>>  Fax +1 514-842-3989
>>>>
>>>>  [email protected] <mailto:[email protected]>
>>>>
>>>>  www.watch4net.com <http://www.watch4net.com/>
>>>>
>>>> /  /
>>>>
>>>> /This message is for the designated recipient only and may contain
>>>> privileged, proprietary, or otherwise private information. If you have
>>>> received it in error, please notify the sender immediately and delete
>>>> the original. Any other use of this electronic mail by you is prohibited.
>>>>
>>>> //Ce message est pour le récipiendaire désigné seulement et peut
>>>> contenir des informations privilégiées, propriétaires ou autrement
>>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
>>>> immédiatement et effacez l'original. Toute autre utilisation de ce
>>>> courrier électronique par vous est prohibée.///
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to