Hey John,

I can certainly provide more detail. 

I think my test case can be simplified by simply creating one partition with a 
single CATEGORY column with a lot of repetition, e.g. 1-2% distinct values or 
something (probably sort order has an importance too).

What i'm doing is selecting this column (with or without a filter) through the 
query/result classes (which create a bundle internaly). The bundle will sort 
the hits using the integer representation of the string, which is the best 
decision according to me. However, geting the list of integer values from a hit 
mask seems to be a very expansive operation (more specifically, relic::keys). 
Since there is no trivial way to get the integer value of the column from a 
given row id, the keys method goes through every distinct values in the index 
and check which one matched the mask. This is what is hurting the perf badly 
compared to a plain uint column where the position of a value is known 
implicitely (sizeof(uint)*index of 1s).

So i don't think there is a bad decision here rather than a missing data 
structure that allows faster mask->category int value resolution. For category 
columns, retrieving the string and then convert it back to int could be better 
but i don't think it would beat a uint column.

My secondary test tend to prove all this by replacing my category column by a 
real uint column (using values from the dictionary). The query runs much faster 
because the cost or retrieving the uint value is very low compared to the cost 
of relic::keys.

Hope this is clearer.

Thanks,

________________________________________
From: K. John Wu [[email protected]]
Sent: January-17-12 9:28 PM
To: FastBit Users
Cc: Dominique Prunier
Subject: Re: [FastBit-users] ibis::relic::keys performance and CATEGORY columns

Hi, Dominique,

To get the values satisfying a query, there is a choice between
whether to get the selected values from the bitmap index or from the
original data file.  Looks like in this case, decision was wrong.  I'd
be happy to take a more careful look into how this decision is made if
you can provide me with a bit more detail..

John


On 1/17/12 5:06 PM, Dominique Prunier wrote:
> Hi,
>
>
>
> I was trying to troubleshoot some performance issues and i found some
> surprising results, maybe related to the fact that i’m not so familiar
> with FastBit.
>
>
>
> I have a partition containing 2,749,086 rows. I’m selecting a category
> column with a quite selective where clause (430,906 hits over the
> 2,749,086 rows, roughly 15%).
>
>
>
> After digging up a bit, i found that most of the time is spent at
> ibis::relic::keys, which was not really what i would have expected (so
> i added a timer to measure it):
>
>
>
> relic::keys -- loop to generate ii took 8.250746 CPU seconds, 8.257573
> elapsed seconds
>
> doQuery:: evaluate(<query>) produced 430906 hits, took 8.71168 CPU
> seconds, 8.72055 elapsed seconds
>
>
>
> I was wondering what could be done to make this function faster.
>
>
>
> My understanding is that currently, it goes through all the bitmaps
> for each distinct value (~37,000 for this column), bitwise-and it with
> the hit vector and collects values for each matching position.
>
> What would you thing about some sort of index, similar to an actual
> uint column where each string corresponding value would be stored at
> its position.
>
> This way, i think it would be possible to collect keys much faster, at
> speed similar to ibis::column::selectUInts.
>
>
>
> To evaluate the potential speedup, i build a UINT column from the
> CATEGORY column, and selecting the UINT column instead of the CATEGORY
> one is MUCH faster (>20x):
>
>
>
> doQuery:: evaluate(<query>) produced 430906 hits, took 0.388941 CPU
> seconds, 0.38984 elapsed seconds
>
>
>
> Do you think this is something big to implement ? Would it make sense
> for you to evaluate this ?
>
>
>
> I’m waiting for your comments on this.
>
>
>
> Thanks,
>
>
>
> */Dominique Prunier/**//*
>
>  APG Lead Developper
>
> Logo-W4N-100dpi
>
>  4388, rue Saint-Denis
>
>  Bureau 309
>
>  Montreal (Quebec)  H2J 2L1
>
>  Tel. +1 514-842-6767  x310
>
>  Fax +1 514-842-3989
>
>  [email protected] <mailto:[email protected]>
>
>  www.watch4net.com <http://www.watch4net.com/>
>
> /  /
>
> /This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information. If you have
> received it in error, please notify the sender immediately and delete
> the original. Any other use of this electronic mail by you is prohibited.
>
> //Ce message est pour le récipiendaire désigné seulement et peut
> contenir des informations privilégiées, propriétaires ou autrement
> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
> immédiatement et effacez l'original. Toute autre utilisation de ce
> courrier électronique par vous est prohibée.///
>
>
>
>
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to