Hey,

A couple of remarks:
 * about the reordering, it is the first solution i though about but i 
remembered it is not available for CATEGORY columns. Maybe that's something 
that would worth a try to implement in the near future (using the .int file, it 
could be roughtly the same as an integer column sort, except that the data file 
would have to be recreated, right ?)
 * i was wondering if we could endup with another where clause evaluation algo 
that could evaluate some of the terms using data, some other using indexes, 
according to some sort of "index badness score" (probably based on the index 
size or something) and the current size of the hit vector
 * i still have to dig to find where most of the time is spent in each of my 
test, have you ever used a profiler on FastBit that you would recommand to 
start with (gcov, ...) ?

Thanks,

________________________________________
From: K. John Wu [[email protected]]
Sent: January-24-12 9:06 PM
To: FastBit Users
Cc: Dominique Prunier
Subject: Re: [FastBit-users] Performance optimization use-case

Hi, Dominique,

FastBit processes the terms in the where clause one at a time.  In
this context, the where clause "p_1='A’ and p_2='B'and p_3='C'" is
said to have three terms, (1) "p_1='A’", (2) "p_2='B'", and (3)
"p_3='C'".  Each term in this case can be answered with reading one
bitmap from the bitmap index file (.idx).  The larger index file for
p_3 indicates that the bitmap to be read from p_3.idx is probably much
larger than the others.

To reduce the index size for p_3, sort the data according to p_3.  You
should be able to give a list of column to use as sorting keys.  In
your case, if p_3 and p_4 are the problematic ones, you should give
both of them as sorting keys.  Since they have relatively low column
cardinalities, sorting should work reasonably well.  However, sorting
may make the other column's indexes larger.  So, give it a try and see
what happens.

John



On 1/24/12 4:17 PM, Dominique Prunier wrote:
> Hi,
>
>
>
> I have a interesting use-case that i’d like to share with everybody
> here to have your thoughts.
>
> I have a large partition (261 columns, almost all of them being
> CATEGORY columns) with roughly 7.5M rows.
>
> Some of these columns have very low cardinality (~10 distinct values)
> and some other have rather high cardinality (~100 000 distinct values).
>
>
>
> On this partition, i’m running a very simple test program to test
> various WHERE clauses:
>
>
>
> *for*(*int*run = 0; run < RUN; run++) {
>
>     *gettimeofday*(&start, NULL);
>
>     *for*(*int*cnt = 0; cnt < CNT; cnt++) {
>
>       FastBitQueryHandleq = *fastbit_build_query*(NULL, PART, WHERE);
>
>       *if*(q != NULL) {
>
>           *fastbit_destroy_query*(q);
>
>       }
>
>     }
>
>     *gettimeofday*(&end, NULL);
>
>     *fprintf*(stderr, "capi: %li us\n", time(start, end) / CNT);
>
> }
>
>
>
> Here are the results for various WHERE clause:
>
>
>
> p_1='A’                                                     capi: 1371
> rows / 19 us
>
> p_1='A’ and p_2='B'                                         capi: 117
> rows / 30 us
>
> p_1='A’ and p_2='B'and p_3='C'                             capi: 11
> rows / 417 us
>
> p_1='A’ and p_2='B'and p_3='C' and p_4='D'                 capi: 11
> rows / 880 us
>
> p_1='A’ and p_2='B'and p_3='C' and p_4='D' and p_4='D'     capi: 11
> rows / 1350 us
>
>
>
> What i’m trying to understand is the big jump between the second and
> the third WHERE.
>
>
>
> The two properties p_3 and p_4 have very low cardinality (~10-20
> distinct values) but spread across the partition.
>
> My guess is that they have a very bad index compression ratio (and it
> turns out that the .idx files are among the biggest, even with such a
> low distinct count).
>
> Given these results, i’d better select p_3 and p_4 and filter it in
> the application than run the fourth query but this is quite hard to
> guess.
>
>
>
> How would you handle this ? Is there any development in FastBit that
> would try to address this use case ?
>
>
>
> Thanks for your comments !
>
>
>
> */Dominique Prunier/**//*
>
>  APG Lead Developper
>
> Logo-W4N-100dpi
>
>  4388, rue Saint-Denis
>
>  Bureau 309
>
>  Montreal (Quebec)  H2J 2L1
>
>  Tel. +1 514-842-6767  x310
>
>  Fax +1 514-842-3989
>
>  [email protected] <mailto:[email protected]>
>
>  www.watch4net.com <http://www.watch4net.com/>
>
> /  /
>
> /This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information. If you have
> received it in error, please notify the sender immediately and delete
> the original. Any other use of this electronic mail by you is prohibited.
>
> //Ce message est pour le récipiendaire désigné seulement et peut
> contenir des informations privilégiées, propriétaires ou autrement
> privées. Si vous l'avez reçu par erreur, S.V.P. avisez l'expéditeur
> immédiatement et effacez l'original. Toute autre utilisation de ce
> courrier électronique par vous est prohibée.///
>
>
>
>
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to