Hey John,

The .int file gets correctly generated and is perfectly equal to the one i 
generated by hand.
However, the selection fails because the column is of type CATEGORY, returning 
a 0 size in elementSize() and warns in selectValuesT:

Warning -- column[<col name>](CATEGORY)::selectValuesT -- incompatible types

Thakns,

-----Original Message-----
From: K. John Wu [mailto:[email protected]] 
Sent: Wednesday, January 18, 2012 2:46 AM
To: FastBit Users
Cc: Dominique Prunier
Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns

Hi, Dominique,

In this particular case, FastBit actually has some code that was
commented out that generate an integer version of the categorical
values.  Using these values should speed up the processing of strings
in group by operations.  An update has been checked in as SVN 460.
Please give it a try when you get a chance.

Thanks.

John


On 1/17/12 7:12 PM, Dominique Prunier wrote:
> Hey John,
> 
> I can certainly provide more detail. 
> 
> I think my test case can be simplified by simply creating one partition with 
> a single CATEGORY column with a lot of repetition, e.g. 1-2% distinct values 
> or something (probably sort order has an importance too).
> 
> What i'm doing is selecting this column (with or without a filter) through 
> the query/result classes (which create a bundle internaly). The bundle will 
> sort the hits using the integer representation of the string, which is the 
> best decision according to me. However, geting the list of integer values 
> from a hit mask seems to be a very expansive operation (more specifically, 
> relic::keys). Since there is no trivial way to get the integer value of the 
> column from a given row id, the keys method goes through every distinct 
> values in the index and check which one matched the mask. This is what is 
> hurting the perf badly compared to a plain uint column where the position of 
> a value is known implicitely (sizeof(uint)*index of 1s).
> 
> So i don't think there is a bad decision here rather than a missing data 
> structure that allows faster mask->category int value resolution. For 
> category columns, retrieving the string and then convert it back to int could 
> be better but i don't think it would beat a uint column.
> 
> My secondary test tend to prove all this by replacing my category column by a 
> real uint column (using values from the dictionary). The query runs much 
> faster because the cost or retrieving the uint value is very low compared to 
> the cost of relic::keys.
> 
> Hope this is clearer.
> 
> Thanks,
> 
> ________________________________________
> From: K. John Wu [[email protected]]
> Sent: January-17-12 9:28 PM
> To: FastBit Users
> Cc: Dominique Prunier
> Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY 
> columns
> 
> Hi, Dominique,
> 
> To get the values satisfying a query, there is a choice between
> whether to get the selected values from the bitmap index or from the
> original data file.  Looks like in this case, decision was wrong.  I'd
> be happy to take a more careful look into how this decision is made if
> you can provide me with a bit more detail..
> 
> John
> 
> 
> On 1/17/12 5:06 PM, Dominique Prunier wrote:
>> Hi,
>>
>>
>>
>> I was trying to troubleshoot some performance issues and i found some
>> surprising results, maybe related to the fact that i’m not so familiar
>> with FastBit.
>>
>>
>>
>> I have a partition containing 2,749,086 rows. I’m selecting a category
>> column with a quite selective where clause (430,906 hits over the
>> 2,749,086 rows, roughly 15%).
>>
>>
>>
>> After digging up a bit, i found that most of the time is spent at
>> ibis::relic::keys, which was not really what i would have expected (so
>> i added a timer to measure it):
>>
>>
>>
>> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
>> elapsed seconds
>>
>> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
>> seconds, 8.72055 elapsed seconds
>>
>>
>>
>> I was wondering what could be done to make this function faster.
>>
>>
>>
>> My understanding is that currently, it goes through all the bitmaps
>> for each distinct value (~37,000 for this column), bitwise-and it with
>> the hit vector and collects values for each matching position.
>>
>> What would you thing about some sort of index, similar to an actual
>> uint column where each string corresponding value would be stored at
>> its position.
>>
>> This way, i think it would be possible to collect keys much faster, at
>> speed similar to ibis::column::selectUInts.
>>
>>
>>
>> To evaluate the potential speedup, i build a UINT column from the
>> CATEGORY column, and selecting the UINT column instead of the CATEGORY
>> one is MUCH faster (>20x):
>>
>>
>>
>> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
>> seconds, 0.38984 elapsed seconds
>>
>>
>>
>> Do you think this is something big to implement ? Would it make sense
>> for you to evaluate this ?
>>
>>
>>
>> I’m waiting for your comments on this.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> */Dominique Prunier/**//*
>>
>>  APG Lead Developper
>>
>> Logo-W4N-100dpi
>>
>>  4388, rue Saint-Denis
>>
>>  Bureau 309
>>
>>  Montreal (Quebec)  H2J 2L1
>>
>>  Tel. +1 514-842-6767  x310
>>
>>  Fax +1 514-842-3989
>>
>>  [email protected] <mailto:[email protected]>
>>
>>  www.watch4net.com <http://www.watch4net.com/>
>>
>> /  /
>>
>> /This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise private information. If you have
>> received it in error, please notify the sender immediately and delete
>> the original. Any other use of this electronic mail by you is prohibited.
>>
>> //Ce message est pour le récipiendaire désigné seulement et peut
>> contenir des informations privilégiées, propriétaires ou autrement
>> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
>> immédiatement et effacez l'original. Toute autre utilisation de ce
>> courrier électronique par vous est prohibée.///
>>
>>
>>
>>
>>
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to