Hi, I have a interesting use-case that i'd like to share with everybody here to have your thoughts. I have a large partition (261 columns, almost all of them being CATEGORY columns) with roughly 7.5M rows. Some of these columns have very low cardinality (~10 distinct values) and some other have rather high cardinality (~100 000 distinct values).
On this partition, i'm running a very simple test program to test various WHERE
clauses:
for (int run = 0; run < RUN; run++) {
gettimeofday(&start, NULL);
for (int cnt = 0; cnt < CNT; cnt++) {
FastBitQueryHandle q = fastbit_build_query(NULL, PART, WHERE);
if (q != NULL) {
fastbit_destroy_query(q);
}
}
gettimeofday(&end, NULL);
fprintf(stderr, "capi: %li us\n", time(start, end) / CNT);
}
Here are the results for various WHERE clause:
p_1='A' capi: 1371 rows /
19 us
p_1='A' and p_2='B' capi: 117 rows / 30
us
p_1='A' and p_2='B' and p_3='C' capi: 11 rows / 417
us
p_1='A' and p_2='B' and p_3='C' and p_4='D' capi: 11 rows / 880
us
p_1='A' and p_2='B' and p_3='C' and p_4='D' and p_4='D' capi: 11 rows /
1350 us
What i'm trying to understand is the big jump between the second and the third
WHERE.
The two properties p_3 and p_4 have very low cardinality (~10-20 distinct
values) but spread across the partition.
My guess is that they have a very bad index compression ratio (and it turns out
that the .idx files are among the biggest, even with such a low distinct count).
Given these results, i'd better select p_3 and p_4 and filter it in the
application than run the fourth query but this is quite hard to guess.
How would you handle this ? Is there any development in FastBit that would try
to address this use case ?
Thanks for your comments !
Dominique Prunier
APG Lead Developper
[cid:[email protected]]
4388, rue Saint-Denis
Bureau 309
Montreal (Quebec) H2J 2L1
Tel. +1 514-842-6767 x310
Fax +1 514-842-3989
[email protected]<mailto:[email protected]>
www.watch4net.com<http://www.watch4net.com/>
This message is for the designated recipient only and may contain privileged,
proprietary, or otherwise private information. If you have received it in
error, please notify the sender immediately and delete the original. Any other
use of this electronic mail by you is prohibited.
Ce message est pour le récipiendaire désigné seulement et peut contenir des
informations privilégiées, propriétaires ou autrement privées. Si vous l'avez
reçu par erreur, S.V.P. avisez l'expéditeur immédiatement et effacez
l'original. Toute autre utilisation de ce courrier électronique par vous est
prohibée.
<<inline: image001.gif>>
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
