Dear Dr. Deri, Thanks for you interest in our software. I would say taking 30 seconds to read a subset from 10 million records is a little too long. Here is a longer explanation. Hope it helps.
John PS: Longer explanation. The bitmap indexes are very good for counting the records satisfy user-specified conditions and locate the positions for these records. However, actually retrieving these records does take time -- the bitmap indexes can not help the readings directly. Compared with the alternative schemes to retrieving these records, FastBit is competitive in most case. There is a published comparison on this (you can find this paper at <http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4318091> and <http://lbl.gov/%7Ekwu/ps/LBNL-62756.html>). The specific comparison is shown in Figures 8 & 9 and discussed in Section 6.4. If you are primarily retrieving data according to the value of one or two variables, we recommend that you reorder the data according to those variables. Otherwise, you are likely retrieving data records from random locations in each data file, in which cases, all the pages of the data files are read into memory (which is the best one can do). Short of reordering, another way to reduce the time of data retrieval is to retrieve less columns. If you are going through command-line tools like ibis, the output records are ordered. If you can work with the records in the same order as they are in the original data files, then directly retrieving the values with one of ibis::part::selectTypes (where Type is one of the concrete types, such as Int, Short, or Float). The bottom line is this, assuming that your records are actually scattered throughout the data files, then the reading time should dominate the time of retrieval and printing. For 10 million records where each records has two int columns, the total data file size should be about 80 MB. Assuming your disk system can support 10 MB/s reading speed, then it would take about 8 seconds to complete the retrieval. 10 million records should take negligible amount of time to sort, but may take a very significant amount of time to print to screen. If you are outputing to screen, I would suggest that you output it to a file (e.g., with ibis -output, or redirect the screen output to a file). On 10/5/2009 2:40 PM, Luca Deri wrote: > Dear all > I have been using fastbit since the initial release in the field of > network monitoring. While I'm impressed by fastbit performance for > counting records that match some criteria, when I actually request to > read the matching records the performance is not great. > > For instance I can count matching records in a matter of msec whereas > retrieving the actual data takes 30 sec (when using 10 Million > records) or more (data was already indexed) using ibis or similar > tool. I was wondering if I make some mistakes when building the > fastbit archives. For this reason I have built a simple program > (enclosed below) that I have used to create dummy data value to query. > Changing the input parameters, I see that the indexing speed changes > significantly, but the query speed is still the same. > > Question: is the obtained performance what you also expect, or did I > make some mistakes while building the fastbit archives? > > Thanks in advance, Luca > > > ---- > > #include <capi.h> > #include <ctype.h> > #include <string.h> > #include <stdlib.h> > > /* ****************************************************** */ > > void timeval_diff(struct timeval *begin, struct timeval *end, struct > timeval *result) { > if(end->tv_sec >= begin->tv_sec) { > result->tv_sec = end->tv_sec-begin->tv_sec; > > if((end->tv_usec - begin->tv_usec) < 0) { > result->tv_usec = 1000000 + end->tv_usec - begin->tv_usec; > if(result->tv_usec > 1000000) begin->tv_usec = 1000000; > result->tv_sec--; > } else > result->tv_usec = end->tv_usec-begin->tv_usec; > } else > result->tv_sec = 0, result->tv_usec = 0; > } > > void append(char *dir, int num, int total) { > int *a_vals, *b_vals, i; > struct timeval begin, end, diff; > u_int32_t v, tot; > > a_vals = (int*)malloc(sizeof(int)*num); > b_vals = (int*)malloc(sizeof(int)*num); > > if(a_vals && b_vals) { > for (i = 0; i < num; i++) > a_vals[i] = i, b_vals[i] = i; > > gettimeofday (& begin, NULL); > for(i=0; i<total; i++) { > fastbit_add_values("a", "int", a_vals, num, 0); > fastbit_add_values("b", "int", b_vals, num, 0); > > fastbit_flush_buffer(dir); > } > gettimeofday (& end, NULL); > timeval_diff (& begin, & end, & diff); > > v = diff.tv_sec*1000 + diff.tv_usec/1000; > tot = i*num; > > printf("written %d records on disk [%.2f sec][%.2f insert/sec]\n", > tot, (float)v/1000, (float)tot*1000/v); > } > } > > int main(int argc, char *argv[]) { > if(argc != 3) { > > printf("fastbit_test <records x shot> <num shots>\n"); > return(0); > } > > append("fb", atoi(argv[1]), atoi(argv[2])); > > return(0); > } > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
