Dear Dr. Deri,

Thanks for you interest in our software.  I would say taking 30 
seconds to read a subset from 10 million records is a little too long. 
  Here is a longer explanation.  Hope it helps.

John

PS: Longer explanation.

The bitmap indexes are very good for counting the records satisfy 
user-specified conditions and locate the positions for these records. 
  However, actually retrieving these records does take time -- the 
bitmap indexes can not help the readings directly.  Compared with the 
alternative schemes to retrieving these records, FastBit is 
competitive in most case.  There is a published comparison on this 
(you can find this paper at 
<http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4318091> and 
<http://lbl.gov/%7Ekwu/ps/LBNL-62756.html>).  The specific comparison 
is shown in Figures 8 & 9 and discussed in Section 6.4.

If you are primarily retrieving data according to the value of one or 
two variables, we recommend that you reorder the data according to 
those variables.  Otherwise, you are likely retrieving data records 
from random locations in each data file, in which cases, all the pages 
of the data files are read into memory (which is the best one can do).

Short of reordering, another way to reduce the time of data retrieval 
is to retrieve less columns.  If you are going through command-line 
tools like ibis, the output records are ordered.  If you can work with 
the records in the same order as they are in the original data files, 
then directly retrieving the values with one of 
ibis::part::selectTypes (where Type is one of the concrete types, such 
as Int, Short, or Float).

The bottom line is this, assuming that your records are actually 
scattered throughout the data files, then the reading time should 
dominate the time of retrieval and printing.  For 10 million records 
where each records has two int columns, the total data file size 
should be about 80 MB.  Assuming your disk system can support 10 MB/s 
reading speed, then it would take about 8 seconds to complete the 
retrieval.  10 million records should take negligible amount of time 
to sort, but may take a very significant amount of time to print to 
screen.  If you are outputing to screen, I would suggest that you 
output it to a file (e.g., with ibis -output, or redirect the screen 
output to a file).



On 10/5/2009 2:40 PM, Luca Deri wrote:
> Dear all
> I have been using fastbit since the initial release in the field of  
> network monitoring. While I'm impressed by fastbit performance for  
> counting records that match  some criteria, when I actually request to  
> read the matching records the performance is not great.
> 
> For instance I can count matching records in a matter of msec whereas  
> retrieving the actual data takes 30 sec (when using 10 Million  
> records) or more (data was already indexed) using ibis or similar  
> tool. I was wondering if I make some mistakes when building the  
> fastbit archives. For this reason I have built a simple program  
> (enclosed below) that I have used to create dummy data value to query.  
> Changing the input parameters, I see that the indexing speed changes  
> significantly, but the query speed is still the same.
> 
> Question: is the obtained performance what you also expect, or did I  
> make some mistakes while building the fastbit archives?
> 
> Thanks in advance, Luca
> 
> 
> ----
> 
> #include <capi.h>
> #include <ctype.h>
> #include <string.h>
> #include <stdlib.h>
> 
> /* ****************************************************** */
> 
> void timeval_diff(struct timeval *begin, struct timeval *end, struct  
> timeval *result) {
>    if(end->tv_sec >= begin->tv_sec) {
>      result->tv_sec = end->tv_sec-begin->tv_sec;
> 
>      if((end->tv_usec - begin->tv_usec) < 0) {
>        result->tv_usec = 1000000 + end->tv_usec - begin->tv_usec;
>        if(result->tv_usec > 1000000) begin->tv_usec = 1000000;
>        result->tv_sec--;
>      } else
>        result->tv_usec = end->tv_usec-begin->tv_usec;
>    } else
>      result->tv_sec = 0, result->tv_usec = 0;
> }
> 
> void append(char *dir, int num, int total) {
>    int *a_vals, *b_vals, i;
>    struct timeval begin, end, diff;
>    u_int32_t v, tot;
> 
>    a_vals = (int*)malloc(sizeof(int)*num);
>    b_vals = (int*)malloc(sizeof(int)*num);
> 
>    if(a_vals && b_vals) {
>    for (i = 0; i < num; i++)
>      a_vals[i] = i, b_vals[i] = i;
> 
>    gettimeofday (& begin, NULL);
>    for(i=0; i<total; i++) {
>      fastbit_add_values("a", "int", a_vals, num, 0);
>      fastbit_add_values("b", "int", b_vals, num, 0);
> 
>      fastbit_flush_buffer(dir);
>    }
>    gettimeofday (& end, NULL);
>    timeval_diff (& begin, & end, & diff);
> 
>    v = diff.tv_sec*1000 + diff.tv_usec/1000;
>    tot = i*num;
> 
>    printf("written %d records on disk [%.2f sec][%.2f insert/sec]\n",
>        tot, (float)v/1000, (float)tot*1000/v);
>    }
> }
> 
> int main(int argc, char *argv[]) {
>    if(argc != 3) {
> 
>      printf("fastbit_test <records x shot> <num shots>\n");
>      return(0);
>    }
> 
>    append("fb", atoi(argv[1]), atoi(argv[2]));
> 
>    return(0);
> }
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to