This is working just fine now !! The createBundle now reports a time of 0.4s 
instead of the previous 8.7s !
In the 0.4s, most of the time is now consumed by the sort (0.35s), which i'm 
going to work on a little bit for category columns.
If i have something interesting, i'll obviously share it with you.

Thanks,

-----Original Message-----
From: K. John Wu [mailto:[email protected]] 
Sent: Wednesday, January 18, 2012 11:52 AM
To: FastBit Users
Cc: Dominique Prunier
Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns

Hi, Dominique,

I have replaced the specific test that produced the warning message
with a different one that is based on the file size rather the element
size of the column.  Should work better this time around.

Please give me a sample query if you still have problems with the
code.  Apparently, my guess of how you invoking the various functions
is not exact correct.

Thanks.

John


On 1/18/12 6:55 AM, Dominique Prunier wrote:
> Hey John,
> 
> The .int file gets correctly generated and is perfectly equal to the one i 
> generated by hand.
> However, the selection fails because the column is of type CATEGORY, 
> returning a 0 size in elementSize() and warns in selectValuesT:
> 
> Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types
> 
> Thakns,
> 
> -----Original Message-----
> From: K. John Wu [mailto:[email protected]] 
> Sent: Wednesday, January 18, 2012 2:46 AM
> To: FastBit Users
> Cc: Dominique Prunier
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Hi, Dominique,
> 
> In this particular case, FastBit actually has some code that was
> commented out that generate an integer version of the categorical
> values.  Using these values should speed up the processing of strings
> in group by operations.  An update has been checked in as SVN 460.
> Please give it a try when you get a chance.
> 
> Thanks.
> 
> John
> 
> 
> On 1/17/12 7:12 PM, Dominique Prunier wrote:
>> Hey John,
>>
>> I can certainly provide more detail. 
>>
>> I think my test case can be simplified by simply creating one partition with 
>> a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct values 
>> or something (probably sort order has an importance too).
>>
>> What i'm doing is selecting this column (with or without a filter) through 
>> the query/result classes (which create a bundle internaly). The bundle will 
>> sort the hits using the integer representation of the string, which is the 
>> best decision according to me. However, geting the list of integer values 
>> from a hit mask seems to be a very expansive operation (more specifically, 
>> relic::keys). Since there is no trivial way to get the integer value of the 
>> column from a given row id, the keys method goes through every distinct 
>> values in the index and check which one matched the mask. This is what is 
>> hurting the perf badly compared to a plain uint column where the position of 
>> a value is known implicitely (sizeof(uint)*index of 1s).
>>
>> So i don't think there is a bad decision here rather than a missing data 
>> structure that allows faster mask->category int value resolution. For 
>> category columns, retrieving the string and then convert it back to int 
>> could be better but i don't think it would beat a uint column.
>>
>> My secondary test tend to prove all this by replacing my category column by 
>> a real uint column (using values from the dictionary). The query runs much 
>> faster because the cost or retrieving the uint value is very low compared to 
>> the cost of relic::keys.

>>
>> Hope this is clearer.
>>
>> Thanks,
>>
>> ________________________________________
>> From: K. John Wu [[email protected]]
>> Sent: January-17-12 9:28 PM
>> To: FastBit Users
>> Cc: Dominique Prunier
>> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
>> columns
>>
>> Hi, Dominique,
>>
>> To get the values satisfying a query, there is a choice between
>> whether to get the selected values from the bitmap index or from the
>> original data file.  Looks like in this case, decision was wrong.  I'd
>> be happy to take a more careful look into how this decision is made if
>> you can provide me with a bit more detail..
>>
>> John
>>
>>
>> On 1/17/12 5:06 PM, Dominique Prunier wrote:
>>> Hi,
>>>
>>>
>>>
>>> I was trying to troubleshoot some performance issues and i found some
>>> surprising results, maybe related to the fact that i’m not so familiar
>>> with FastBit.
>>>
>>>
>>>
>>> I have a partition containing 2,749,086 rows. I’m selecting a category
>>> column with a quite selective where clause (430,906 hits over the
>>> 2,749,086 rows, roughly 15%).
>>>
>>>
>>>
>>> After digging up a bit, i found that most of the time is spent at
>>> ibis::relic::keys, which was not really what i would have expected (so
>>> i added a timer to measure it):
>>>
>>>
>>>
>>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
>>> elapsed seconds
>>>
>>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
>>> seconds, 8.72055 elapsed seconds
>>>
>>>
>>>
>>> I was wondering what could be done to make this function faster.
>>>
>>>
>>>
>>> My understanding is that currently, it goes through all the bitmaps
>>> for each distinct value (~37,000 for this column), bitwise-and it with
>>> the hit vector and collects values for each matching position.
>>>
>>> What would you thing about some sort of index, similar to an actual
>>> uint column where each string corresponding value would be stored at
>>> its position.
>>>
>>> This way, i think it would be possible to collect keys much faster, at
>>> speed similar to ibis::column::selectUInts.
>>>
>>>
>>>
>>> To evaluate the potential speedup, i build a UINT column from the
>>> CATEGORY column, and selecting the UINT column instead of the CATEGORY
>>> one is MUCH faster (>20x):
>>>
>>>
>>>
>>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
>>> seconds, 0.38984 elapsed seconds
>>>
>>>
>>>
>>> Do you think this is something big to implement ? Would it make sense
>>> for you to evaluate this ?
>>>
>>>
>>>
>>> I’m waiting for your comments on this.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> */Dominique Prunier/**//*
>>>
>>>  APG Lead Developper
>>>
>>> Logo-W4N-100dpi
>>>
>>>  4388, rue Saint-Denis
>>>
>>>  Bureau 309
>>>
>>>  Montreal (Quebec)  H2J 2L1
>>>
>>>  Tel. +1 514-842-6767  x310
>>>
>>>  Fax +1 514-842-3989
>>>
>>>  [email protected] <mailto:[email protected]>
>>>
>>>  www.watch4net.com <http://www.watch4net.com/>
>>>
>>> /  /
>>>
>>> /This message is for the designated recipient only and may contain
>>> privileged, proprietary, or otherwise private information. If you have
>>> received it in error, please notify the sender immediately and delete
>>> the original. Any other use of this electronic mail by you is prohibited.
>>>
>>> //Ce message est pour le récipiendaire désigné seulement et peut
>>> contenir des informations privilégiées, propriétaires ou autrement
>>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
>>> immédiatement et effacez l'original. Toute autre utilisation de ce
>>> courrier électronique par vous est prohibée.///
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to