Hi,

I am trying to figure out how to incrementally add data to a FastBit index
using ardea and ibis.  I'm having some weird results and I don't know if I
am doing something wrong, misunderstanding how things work, or if there are
some bugs that I am hitting.

First, I want to make sure I understand the concept of the fastbit "data
directory" correctly.  As I understand it, one data directory can contain
many partitions.  I want to be able to add partitions dynamically.

I thought the following would work:
$ mkdir /tmp/test

Load 10M rows:
$ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
"/tmp/test"
  Will attempt to parse 1 CSV file
/tmp/fbdata.txt
 with the following column names and types
c2:int,c3:int


/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read
CSV file /tmp/fbdata.txt ...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
10000000 rows from /tmp/fbdata.txt

/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
4.69221 sec(CPU), 4.39101 sec(elapsed)

Load 10M more rows:
$ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
"/tmp/test"
  Will attempt to parse 1 CSV file
/tmp/fbdata.txt
 with the following column names and types
c2:int,c3:int


/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read
CSV file /tmp/fbdata.txt ...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
10000000 rows from /tmp/fbdata.txt

/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
6.5238 sec(CPU), 6.53868 sec(elapsed)

Now check the number of rows in the data directory:
$ ./ibis -d /tmp/test -q "select count(*)"

SELECT count(*) FROM T-_01 WHERE 1=1 produced a table with 1 row and 1
column
-- the result table (1 x 1) for "SELECT count(*) FROM T-_01 WHERE 1=1"
30000000

It should be 20000000 not 30000000.

If I use a different partition name each time to .ardea, it works:
$ rm -rf /tmp/test && mkdir /tmp/test
$ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt -n p1
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
"/tmp/test"
  Will attempt to parse 1 CSV file
/tmp/fbdata.txt
 with the following column names and types
c2:int,c3:int


/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read
CSV file /tmp/fbdata.txt ...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
10000000 rows from /tmp/fbdata.txt

/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
4.35042 sec(CPU), 4.13331 sec(elapsed)
$ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt -n p2
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
"/tmp/test"
  Will attempt to parse 1 CSV file
/tmp/fbdata.txt
 with the following column names and types
c2:int,c3:int


/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read
CSV file /tmp/fbdata.txt ...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
10000000 rows from /tmp/fbdata.txt

/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
4.32884 sec(CPU), 4.05803 sec(elapsed)

$ ./ibis -d /tmp/test -q "select count(*)"

SELECT count(*) FROM T-p1 WHERE 1=1 produced a table with 1 row and 1 column
-- the result table (1 x 1) for "SELECT count(*) FROM T-p1 WHERE 1=1"
20000000

I also get weird output when I try to merge two datadirs with partitions:
$ mkdir /tmp/fb1
$ ./ardea -p 5000000 -d /tmp/fb1 -m c2:int,c3:int -t /tmp/fbdata.txt
...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
4.49258 sec(CPU), 4.23947 sec(elapsed)


$ mkdir /tmp/fb2
$ ./ardea -p 5000000 -d /tmp/fb2 -m c2:int,c3:int -t /tmp/fbdata.txt
...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
4.28029 sec(CPU), 4.05654 sec(elapsed)

$ ./ibis -d /tmp/fb1 -a /tmp/fb2 /tmp/fb1

Sat Feb 21 11:57:05 2015
Warning -- part[fb1]::appendToBackup -- expected to add 10000000 elements
of "c2", but actually added 5000000
Sat Feb 21 11:57:06 2015
Warning -- part[fb1]::appendToBackup -- expected to add 10000000 elements
of "c3", but actually added 5000000
part[fb1]::append -- committed to use the updated dataset with 20000000
rows and 2 columns
doAppend(/tmp/fb2): added 10000000 rows from /tmp/fb2 to data partition fb1
located in /tmp/fb1

It is the warnings that concern me, though I think they are spurious as the
rowcount in the -part.txt file for each partition is wrong as demonstrated
by this test:
[justin@localhost examples]$ ./ardea -p 5000000 -d /tmp/fb3 -m
c2:int,c3:int -t /tmp/fbdata.txt
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
"/tmp/fb3"
  Will attempt to parse 1 CSV file
/tmp/fbdata.txt
 with the following column names and types
c2:int,c3:int


/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read
CSV file /tmp/fbdata.txt ...
/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
10000000 rows from /tmp/fbdata.txt

/home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration:
7.26937 sec(CPU), 7.26995 sec(elapsed)

$ grep row /tmp/fb3/-part.txt
Number_of_rows = 10000000

]$ grep row /tmp/fb3/_01/-part.txt
Number_of_rows = 10000000
[justin@localhost examples]$
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to