Hi, I am trying to figure out how to incrementally add data to a FastBit index using ardea and ibis. I'm having some weird results and I don't know if I am doing something wrong, misunderstanding how things work, or if there are some bugs that I am hitting.
First, I want to make sure I understand the concept of the fastbit "data directory" correctly. As I understand it, one data directory can contain many partitions. I want to be able to add partitions dynamically. I thought the following would work: $ mkdir /tmp/test Load 10M rows: $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d "/tmp/test" Will attempt to parse 1 CSV file /tmp/fbdata.txt with the following column names and types c2:int,c3:int /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read CSV file /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read 10000000 rows from /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 4.69221 sec(CPU), 4.39101 sec(elapsed) Load 10M more rows: $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d "/tmp/test" Will attempt to parse 1 CSV file /tmp/fbdata.txt with the following column names and types c2:int,c3:int /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read CSV file /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read 10000000 rows from /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 6.5238 sec(CPU), 6.53868 sec(elapsed) Now check the number of rows in the data directory: $ ./ibis -d /tmp/test -q "select count(*)" SELECT count(*) FROM T-_01 WHERE 1=1 produced a table with 1 row and 1 column -- the result table (1 x 1) for "SELECT count(*) FROM T-_01 WHERE 1=1" 30000000 It should be 20000000 not 30000000. If I use a different partition name each time to .ardea, it works: $ rm -rf /tmp/test && mkdir /tmp/test $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt -n p1 /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d "/tmp/test" Will attempt to parse 1 CSV file /tmp/fbdata.txt with the following column names and types c2:int,c3:int /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read CSV file /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read 10000000 rows from /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 4.35042 sec(CPU), 4.13331 sec(elapsed) $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt -n p2 /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d "/tmp/test" Will attempt to parse 1 CSV file /tmp/fbdata.txt with the following column names and types c2:int,c3:int /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read CSV file /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read 10000000 rows from /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 4.32884 sec(CPU), 4.05803 sec(elapsed) $ ./ibis -d /tmp/test -q "select count(*)" SELECT count(*) FROM T-p1 WHERE 1=1 produced a table with 1 row and 1 column -- the result table (1 x 1) for "SELECT count(*) FROM T-p1 WHERE 1=1" 20000000 I also get weird output when I try to merge two datadirs with partitions: $ mkdir /tmp/fb1 $ ./ardea -p 5000000 -d /tmp/fb1 -m c2:int,c3:int -t /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 4.49258 sec(CPU), 4.23947 sec(elapsed) $ mkdir /tmp/fb2 $ ./ardea -p 5000000 -d /tmp/fb2 -m c2:int,c3:int -t /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 4.28029 sec(CPU), 4.05654 sec(elapsed) $ ./ibis -d /tmp/fb1 -a /tmp/fb2 /tmp/fb1 Sat Feb 21 11:57:05 2015 Warning -- part[fb1]::appendToBackup -- expected to add 10000000 elements of "c2", but actually added 5000000 Sat Feb 21 11:57:06 2015 Warning -- part[fb1]::appendToBackup -- expected to add 10000000 elements of "c3", but actually added 5000000 part[fb1]::append -- committed to use the updated dataset with 20000000 rows and 2 columns doAppend(/tmp/fb2): added 10000000 rows from /tmp/fb2 to data partition fb1 located in /tmp/fb1 It is the warnings that concern me, though I think they are spurious as the rowcount in the -part.txt file for each partition is wrong as demonstrated by this test: [justin@localhost examples]$ ./ardea -p 5000000 -d /tmp/fb3 -m c2:int,c3:int -t /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d "/tmp/fb3" Will attempt to parse 1 CSV file /tmp/fbdata.txt with the following column names and types c2:int,c3:int /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to read CSV file /tmp/fbdata.txt ... /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read 10000000 rows from /tmp/fbdata.txt /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- duration: 7.26937 sec(CPU), 7.26995 sec(elapsed) $ grep row /tmp/fb3/-part.txt Number_of_rows = 10000000 ]$ grep row /tmp/fb3/_01/-part.txt Number_of_rows = 10000000 [justin@localhost examples]$
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
