Re: [FastBit-users] creating partitions and appending data

K. John Wu Sun, 22 Feb 2015 11:44:21 -0800

Thanks, Justin, for your interest in our work.  You are basically on
the right track.  There is one thing that might be useful to clear up.


Each FastBit partition is to be placed in one directory.  The program
'ibis' could put many of these directories into a virtual "table".  An
easy way of name multiple directories is to simply name the parent
directory of all of them - so it appears that you are only using one
directory.

When you create data partitions with ardea, I suggest that you name
each directory separately.  So you could have a series of them such as

ardea -p 5000000 -d /tmp/test/a -m c2:int,c3:int -t /tmp/fbdata.txt
ardea -p 5000000 -d /tmp/test/b -m c2:int,c3:int -t /tmp/fbdata.txt

The option -p tells ardea to create subdirectories in /tmp/test/a when
necessary.

Hope this help.

John


On 2/21/15 11:02 AM, Justin Swanhart wrote:
> Hi,
> 
> I am trying to figure out how to incrementally add data to a FastBit
> index using ardea and ibis.  I'm having some weird results and I don't
> know if I am doing something wrong, misunderstanding how things work,
> or if there are some bugs that I am hitting.
> 
> First, I want to make sure I understand the concept of the fastbit
> "data directory" correctly.  As I understand it, one data directory
> can contain many partitions.  I want to be able to add partitions
> dynamically.
> 
> I thought the following would work:
> $ mkdir /tmp/test
> 
> Load 10M rows:
> $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
> "/tmp/test"
>   Will attempt to parse 1 CSV file
> /tmp/fbdata.txt
>  with the following column names and types
> c2:int,c3:int
> 
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to
> read CSV file /tmp/fbdata.txt ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
> 10000000 rows from /tmp/fbdata.txt
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 4.69221 sec(CPU), 4.39101 sec(elapsed)
> 
> Load 10M more rows:
> $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
> "/tmp/test"
>   Will attempt to parse 1 CSV file
> /tmp/fbdata.txt
>  with the following column names and types
> c2:int,c3:int
> 
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to
> read CSV file /tmp/fbdata.txt ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
> 10000000 rows from /tmp/fbdata.txt
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 6.5238 sec(CPU), 6.53868 sec(elapsed)
> 
> Now check the number of rows in the data directory:
> $ ./ibis -d /tmp/test -q "select count(*)"
> 
> SELECT count(*) FROM T-_01 WHERE 1=1 produced a table with 1 row and 1
> column
> -- the result table (1 x 1) for "SELECT count(*) FROM T-_01 WHERE 1=1"
> 30000000
> 
> It should be 20000000 not 30000000.  
> 
> If I use a different partition name each time to .ardea, it works:
> $ rm -rf /tmp/test && mkdir /tmp/test
> $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt
> -n p1
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
> "/tmp/test"
>   Will attempt to parse 1 CSV file
> /tmp/fbdata.txt
>  with the following column names and types
> c2:int,c3:int
> 
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to
> read CSV file /tmp/fbdata.txt ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
> 10000000 rows from /tmp/fbdata.txt
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 4.35042 sec(CPU), 4.13331 sec(elapsed)
> $ ./ardea -p 5000000 -d /tmp/test -m c2:int,c3:int -t /tmp/fbdata.txt
> -n p2
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
> "/tmp/test"
>   Will attempt to parse 1 CSV file
> /tmp/fbdata.txt
>  with the following column names and types
> c2:int,c3:int
> 
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to
> read CSV file /tmp/fbdata.txt ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
> 10000000 rows from /tmp/fbdata.txt
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 4.32884 sec(CPU), 4.05803 sec(elapsed)
> 
> $ ./ibis -d /tmp/test -q "select count(*)"
> 
> SELECT count(*) FROM T-p1 WHERE 1=1 produced a table with 1 row and 1
> column
> -- the result table (1 x 1) for "SELECT count(*) FROM T-p1 WHERE 1=1"
> 20000000
> 
> I also get weird output when I try to merge two datadirs with partitions:
> $ mkdir /tmp/fb1
> $ ./ardea -p 5000000 -d /tmp/fb1 -m c2:int,c3:int -t /tmp/fbdata.txt
> ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 4.49258 sec(CPU), 4.23947 sec(elapsed)
> 
> 
> $ mkdir /tmp/fb2
> $ ./ardea -p 5000000 -d /tmp/fb2 -m c2:int,c3:int -t /tmp/fbdata.txt
> ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 4.28029 sec(CPU), 4.05654 sec(elapsed)
> 
> $ ./ibis -d /tmp/fb1 -a /tmp/fb2 /tmp/fb1
> 
> Sat Feb 21 11:57:05 2015
> Warning -- part[fb1]::appendToBackup -- expected to add 10000000
> elements of "c2", but actually added 5000000
> Sat Feb 21 11:57:06 2015
> Warning -- part[fb1]::appendToBackup -- expected to add 10000000
> elements of "c3", but actually added 5000000
> part[fb1]::append -- committed to use the updated dataset with
> 20000000 rows and 2 columns
> doAppend(/tmp/fb2): added 10000000 rows from /tmp/fb2 to data
> partition fb1 located in /tmp/fb1
> 
> It is the warnings that concern me, though I think they are spurious
> as the rowcount in the -part.txt file for each partition is wrong as
> demonstrated by this test:
> [justin@localhost examples]$ ./ardea -p 5000000 -d /tmp/fb3 -m
> c2:int,c3:int -t /tmp/fbdata.txt
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea -v 0 -d
> "/tmp/fb3"
>   Will attempt to parse 1 CSV file
> /tmp/fbdata.txt
>  with the following column names and types
> c2:int,c3:int
> 
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea is to
> read CSV file /tmp/fbdata.txt ...
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea read
> 10000000 rows from /tmp/fbdata.txt
> 
> /home/justin/FastBit_UDF/fastbit-2.0.1/examples/.libs/lt-ardea --
> duration: 7.26937 sec(CPU), 7.26995 sec(elapsed)
> 
> $ grep row /tmp/fb3/-part.txt
> Number_of_rows = 10000000
> 
> ]$ grep row /tmp/fb3/_01/-part.txt
> Number_of_rows = 10000000
> [justin@localhost examples]$ 
> 
> 
> 
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] creating partitions and appending data

Reply via email to