John many thanks for your reply. I have sorted results and the search speed is now -50%. I will read your papers to better understand the fastbit internals and perhaps better use the library. In any case, I confirm you that thanks to your work I have been able to dramatically improve performance with respect to SQL databases. I encourage you to continue the developments and further improve fastbit.
Regards Luca On Oct 6, 2009, at 1:25 AM, K. John Wu wrote: > Dear Dr. Deri, > > Thanks for you interest in our software. I would say taking 30 > seconds to read a subset from 10 million records is a little too long. > Here is a longer explanation. Hope it helps. > > John > > PS: Longer explanation. > > The bitmap indexes are very good for counting the records satisfy > user-specified conditions and locate the positions for these records. > However, actually retrieving these records does take time -- the > bitmap indexes can not help the readings directly. Compared with the > alternative schemes to retrieving these records, FastBit is > competitive in most case. There is a published comparison on this > (you can find this paper at > <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4318091> and > <http://lbl.gov/%7Ekwu/ps/LBNL-62756.html>). The specific comparison > is shown in Figures 8 & 9 and discussed in Section 6.4. > > If you are primarily retrieving data according to the value of one or > two variables, we recommend that you reorder the data according to > those variables. Otherwise, you are likely retrieving data records > from random locations in each data file, in which cases, all the pages > of the data files are read into memory (which is the best one can do). > > Short of reordering, another way to reduce the time of data retrieval > is to retrieve less columns. If you are going through command-line > tools like ibis, the output records are ordered. If you can work with > the records in the same order as they are in the original data files, > then directly retrieving the values with one of > ibis::part::selectTypes (where Type is one of the concrete types, such > as Int, Short, or Float). > > The bottom line is this, assuming that your records are actually > scattered throughout the data files, then the reading time should > dominate the time of retrieval and printing. For 10 million records > where each records has two int columns, the total data file size > should be about 80 MB. Assuming your disk system can support 10 MB/s > reading speed, then it would take about 8 seconds to complete the > retrieval. 10 million records should take negligible amount of time > to sort, but may take a very significant amount of time to print to > screen. If you are outputing to screen, I would suggest that you > output it to a file (e.g., with ibis -output, or redirect the screen > output to a file). > > > > On 10/5/2009 2:40 PM, Luca Deri wrote: >> Dear all >> I have been using fastbit since the initial release in the field of >> network monitoring. While I'm impressed by fastbit performance for >> counting records that match some criteria, when I actually request >> to >> read the matching records the performance is not great. >> >> For instance I can count matching records in a matter of msec whereas >> retrieving the actual data takes 30 sec (when using 10 Million >> records) or more (data was already indexed) using ibis or similar >> tool. I was wondering if I make some mistakes when building the >> fastbit archives. For this reason I have built a simple program >> (enclosed below) that I have used to create dummy data value to >> query. >> Changing the input parameters, I see that the indexing speed changes >> significantly, but the query speed is still the same. >> >> Question: is the obtained performance what you also expect, or did I >> make some mistakes while building the fastbit archives? >> >> Thanks in advance, Luca >> >> >> ---- >> >> #include <capi.h> >> #include <ctype.h> >> #include <string.h> >> #include <stdlib.h> >> >> /* ****************************************************** */ >> >> void timeval_diff(struct timeval *begin, struct timeval *end, struct >> timeval *result) { >> if(end->tv_sec >= begin->tv_sec) { >> result->tv_sec = end->tv_sec-begin->tv_sec; >> >> if((end->tv_usec - begin->tv_usec) < 0) { >> result->tv_usec = 1000000 + end->tv_usec - begin->tv_usec; >> if(result->tv_usec > 1000000) begin->tv_usec = 1000000; >> result->tv_sec--; >> } else >> result->tv_usec = end->tv_usec-begin->tv_usec; >> } else >> result->tv_sec = 0, result->tv_usec = 0; >> } >> >> void append(char *dir, int num, int total) { >> int *a_vals, *b_vals, i; >> struct timeval begin, end, diff; >> u_int32_t v, tot; >> >> a_vals = (int*)malloc(sizeof(int)*num); >> b_vals = (int*)malloc(sizeof(int)*num); >> >> if(a_vals && b_vals) { >> for (i = 0; i < num; i++) >> a_vals[i] = i, b_vals[i] = i; >> >> gettimeofday (& begin, NULL); >> for(i=0; i<total; i++) { >> fastbit_add_values("a", "int", a_vals, num, 0); >> fastbit_add_values("b", "int", b_vals, num, 0); >> >> fastbit_flush_buffer(dir); >> } >> gettimeofday (& end, NULL); >> timeval_diff (& begin, & end, & diff); >> >> v = diff.tv_sec*1000 + diff.tv_usec/1000; >> tot = i*num; >> >> printf("written %d records on disk [%.2f sec][%.2f insert/sec]\n", >> tot, (float)v/1000, (float)tot*1000/v); >> } >> } >> >> int main(int argc, char *argv[]) { >> if(argc != 3) { >> >> printf("fastbit_test <records x shot> <num shots>\n"); >> return(0); >> } >> >> append("fb", atoi(argv[1]), atoi(argv[2])); >> >> return(0); >> } >> >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
