Hi,

I have a interesting use-case that i'd like to share with everybody here to 
have your thoughts.
I have a large partition (261 columns, almost all of them being CATEGORY 
columns) with roughly 7.5M rows.
Some of these columns have very low cardinality (~10 distinct values) and some 
other have rather high cardinality (~100 000 distinct values).

On this partition, i'm running a very simple test program to test various WHERE 
clauses:

for (int run = 0; run < RUN; run++) {
    gettimeofday(&start, NULL);
    for (int cnt = 0; cnt < CNT; cnt++) {
      FastBitQueryHandle q = fastbit_build_query(NULL, PART, WHERE);
      if (q != NULL) {
          fastbit_destroy_query(q);
      }
    }
    gettimeofday(&end, NULL);
    fprintf(stderr, "capi: %li us\n", time(start, end) / CNT);
}

Here are the results for various WHERE clause:

p_1='A'                                                     capi: 1371 rows / 
19 us
p_1='A' and p_2='B'                                         capi: 117 rows / 30 
us
p_1='A' and p_2='B' and p_3='C'                             capi: 11 rows / 417 
us
p_1='A' and p_2='B' and p_3='C' and p_4='D'                 capi: 11 rows / 880 
us
p_1='A' and p_2='B' and p_3='C' and p_4='D' and p_4='D'     capi: 11 rows / 
1350 us

What i'm trying to understand is the big jump between the second and the third 
WHERE.

The two properties p_3 and p_4 have very low cardinality (~10-20 distinct 
values) but spread across the partition.
My guess is that they have a very bad index compression ratio (and it turns out 
that the .idx files are among the biggest, even with such a low distinct count).
Given these results, i'd better select p_3 and p_4 and filter it in the 
application than run the fourth query but this is quite hard to guess.

How would you handle this ? Is there any development in FastBit that would try 
to address this use case ?

Thanks for your comments !

Dominique Prunier
 APG Lead Developper
[cid:[email protected]]
 4388, rue Saint-Denis
 Bureau 309
 Montreal (Quebec)  H2J 2L1
 Tel. +1 514-842-6767  x310
 Fax +1 514-842-3989
 [email protected]<mailto:[email protected]>
 www.watch4net.com<http://www.watch4net.com/>

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise private information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of this electronic mail by you is prohibited.

Ce message est pour le récipiendaire désigné seulement et peut contenir des 
informations privilégiées, propriétaires ou autrement privées. Si vous l'avez 
reçu par erreur, S.V.P. avisez l'expéditeur immédiatement et effacez 
l'original. Toute autre utilisation de ce courrier électronique par vous est 
prohibée.

<<inline: image001.gif>>

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to