Hi, Thanks for the info.
Is there any way to drop a column from disk? I also noticed this interesting issue with count, and I'm not sure if it is a bug or expected behavior: SELECT count(this_column_does_not_exist) FROM T-1424547245 WHERE 1=1 produced a table with 1 row and 1 column -- the result table (1 x 1) for "SELECT count(this_column_does_not_exist) FROM T-1424547245 WHERE 1=1" 80000000 One other thing, I encountered a gotcha with the ardea test program. If you pass a -t parameter and -d parameter, but you forget a -m or -M parameter, then ardea will write test data into your index. :) I was stumped where some extra rows came from until I realized what happened. FYI: I'm working on a set of UDF (user defined functions) for MySQL for creating and manipulating FastBit data. MySQL doesn't have pluggable indexes, so it makes sense to use a UDF to access fastbit functionality. I'm almost finished implementing basic functions to create indexes, add data and query indexes. I still need to add a function to mark rows as deleted. I may look at integrating FastBit as an actual storage engine at some time in the future, but the SE interface is much more complex than the UDF interface, and UDF meet my current needs. I'll send you the github link when I'm done, if you like. --Justin On Sun, Feb 22, 2015 at 12:43 PM, K. John Wu <[email protected]> wrote: > Thanks, Justin, for your interest in our work. You are basically on > the right track. There is one thing that might be useful to clear up. > > Each FastBit partition is to be placed in one directory. The program > 'ibis' could put many of these directories into a virtual "table". An > easy way of name multiple directories is to simply name the parent > directory of all of them - so it appears that you are only using one > directory. > > When you create data partitions with ardea, I suggest that you name > each directory separately. So you could have a series of them such as > > ardea -p 5000000 -d /tmp/test/a -m c2:int,c3:int -t /tmp/fbdata.txt > ardea -p 5000000 -d /tmp/test/b -m c2:int,c3:int -t /tmp/fbdata.txt > > The option -p tells ardea to create subdirectories in /tmp/test/a when > necessary. > > Hope this help. > > John > > > On 2/21/15 11:02 AM, Justin Swanhart wrote: > > Hi, > > > > I am trying to figure out how to incrementally add data to a FastBit > > index using ardea and ibis. I'm having some weird results and I don't > > know if I am doing something wrong, misunderstanding how things work, > > or if there are some bugs that I am hitting. > > > > First, I want to make sure I understand the concept of the fastbit > > "data directory" correctly. As I understand it, one data directory > > can contain many partitions. I want to be able to add partitions > > dynamically. > > > > I thought the following would work: > > $ mkdir /tmp/test > > > > Load 10M rows: > > $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d > > "/tmp/test" > > Will attempt to parse 1 CSV file > > /tmp/fbdata.txt > > with the following column names and types > > c2:int,c3:int > > > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to > > read CSV file /tmp/fbdata.txt ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read > > 10000000 rows from /tmp/fbdata.txt > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 4.69221 sec(CPU), 4.39101 sec(elapsed) > > > > Load 10M more rows: > > $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d > > "/tmp/test" > > Will attempt to parse 1 CSV file > > /tmp/fbdata.txt > > with the following column names and types > > c2:int,c3:int > > > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to > > read CSV file /tmp/fbdata.txt ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read > > 10000000 rows from /tmp/fbdata.txt > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 6.5238 sec(CPU), 6.53868 sec(elapsed) > > > > Now check the number of rows in the data directory: > > $ ./ibis -d /tmp/test -q "select count(*)" > > > > SELECT count(*) FROM T-_01 WHERE 1=1 produced a table with 1 row and 1 > > column > > -- the result table (1 x 1) for "SELECT count(*) FROM T-_01 WHERE 1=1" > > 30000000 > > > > It should be 20000000 not 30000000. > > > > If I use a different partition name each time to .ardea, it works: > > $ rm -rf /tmp/test && mkdir /tmp/test > > $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt > > -n p1 > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d > > "/tmp/test" > > Will attempt to parse 1 CSV file > > /tmp/fbdata.txt > > with the following column names and types > > c2:int,c3:int > > > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to > > read CSV file /tmp/fbdata.txt ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read > > 10000000 rows from /tmp/fbdata.txt > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 4.35042 sec(CPU), 4.13331 sec(elapsed) > > $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt > > -n p2 > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d > > "/tmp/test" > > Will attempt to parse 1 CSV file > > /tmp/fbdata.txt > > with the following column names and types > > c2:int,c3:int > > > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to > > read CSV file /tmp/fbdata.txt ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read > > 10000000 rows from /tmp/fbdata.txt > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 4.32884 sec(CPU), 4.05803 sec(elapsed) > > > > $ ./ibis -d /tmp/test -q "select count(*)" > > > > SELECT count(*) FROM T-p1 WHERE 1=1 produced a table with 1 row and 1 > > column > > -- the result table (1 x 1) for "SELECT count(*) FROM T-p1 WHERE 1=1" > > 20000000 > > > > I also get weird output when I try to merge two datadirs with partitions: > > $ mkdir /tmp/fb1 > > $ ./ardea -p 5000000 -d /tmp/fb1 -m c2:int,c3:int -t /tmp/fbdata.txt > > ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 4.49258 sec(CPU), 4.23947 sec(elapsed) > > > > > > $ mkdir /tmp/fb2 > > $ ./ardea -p 5000000 -d /tmp/fb2 -m c2:int,c3:int -t /tmp/fbdata.txt > > ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 4.28029 sec(CPU), 4.05654 sec(elapsed) > > > > $ ./ibis -d /tmp/fb1 -a /tmp/fb2 /tmp/fb1 > > > > Sat Feb 21 11:57:05 2015 > > Warning -- part[fb1]::appendToBackup -- expected to add 10000000 > > elements of "c2", but actually added 5000000 > > Sat Feb 21 11:57:06 2015 > > Warning -- part[fb1]::appendToBackup -- expected to add 10000000 > > elements of "c3", but actually added 5000000 > > part[fb1]::append -- committed to use the updated dataset with > > 20000000 rows and 2 columns > > doAppend(/tmp/fb2): added 10000000 rows from /tmp/fb2 to data > > partition fb1 located in /tmp/fb1 > > > > It is the warnings that concern me, though I think they are spurious > > as the rowcount in the -part.txt file for each partition is wrong as > > demonstrated by this test: > > [justin@localhost examples]$ ./ardea -p 5000000 -d /tmp/fb3 -m > > c2:int,c3:int -t /tmp/fbdata.txt > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d > > "/tmp/fb3" > > Will attempt to parse 1 CSV file > > /tmp/fbdata.txt > > with the following column names and types > > c2:int,c3:int > > > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to > > read CSV file /tmp/fbdata.txt ... > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read > > 10000000 rows from /tmp/fbdata.txt > > > > /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -- > > duration: 7.26937 sec(CPU), 7.26995 sec(elapsed) > > > > $ grep row /tmp/fb3/-part.txt > > Number_of_rows = 10000000 > > > > ]$ grep row /tmp/fb3/_01/-part.txt > > Number_of_rows = 10000000 > > [justin@localhost examples]$ > > > > > > > > > > _______________________________________________ > > FastBit-users mailing list > > [email protected] > > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >
_______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
