Re: [FastBit-users] metaTag dictionary issue

Olson, Andrew Thu, 13 Sep 2012 05:22:32 -0700

Hi John,
I tried thula with -M to merge the dictionaries.  This speeds up the 
allele_string,count(*) query to 1.65s, which is as great!  The FBchr, count(*) 
query doesn't return any results unless you run with -v (but it still takes 20s 
that way).  Here are the outputs:


$ time thula -d . -s "FBchr,count(*)" -w "1=1" 
doQuery(1=1) evaluated on T-1 produced 0 hit out of 59161937 records
-- begin printing the result table --
Table CAiNa1 (filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions 
WHERE 1 == ...)) contains 0 row and no column
-- end printing --


real    0m0.074s
user    0m0.038s
sys     0m0.035s


$ time thula -d . -s "FBchr,count(*)" -w "1=1" -v

Constructed a part named 1
Constructed a part named 10
Constructed a part named 2
Constructed a part named 3
Constructed a part named 4
Constructed a part named 5
Constructed a part named 6
Constructed a part named 7
Constructed a part named 8
Constructed a part named 9
mensa::addPartition(.) increases the number partitions from 0 to 10, the number 
of rows from 0 to 59161937, and the number of columns from 0 to 9
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 1
countQuery::evaluate -- Select count(*) From 1 Where 1 == 1 --> 8960156
countQuery::evaluate -- duration: 4.6e-05 sec(CPU), 4.60148e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 10
countQuery::evaluate -- Select count(*) From 10 Where 1 == 1 --> 4332964
countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.50611e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 2
countQuery::evaluate -- Select count(*) From 2 Where 1 == 1 --> 6828398
countQuery::evaluate -- duration: 4.5e-05 sec(CPU), 4.60148e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 3
countQuery::evaluate -- Select count(*) From 3 Where 1 == 1 --> 6687301
countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.41074e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 4
countQuery::evaluate -- Select count(*) From 4 Where 1 == 1 --> 7064960
countQuery::evaluate -- duration: 5.3e-05 sec(CPU), 5.38826e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 5
countQuery::evaluate -- Select count(*) From 5 Where 1 == 1 --> 6081271
countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.50611e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 6
countQuery::evaluate -- Select count(*) From 6 Where 1 == 1 --> 4712491
countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.31538e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 7
countQuery::evaluate -- Select count(*) From 7 Where 1 == 1 --> 5030796
countQuery::evaluate -- duration: 4.5e-05 sec(CPU), 4.60148e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 8
countQuery::evaluate -- Select count(*) From 8 Where 1 == 1 --> 4944862
countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.41074e-05 sec(elapsed)
filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- 
processing data partition 9
countQuery::evaluate -- Select count(*) From 9 Where 1 == 1 --> 4518738
countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.3869e-05 sec(elapsed)
doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records
-- begin printing the result table --
Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY 
FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows
FBchr   CATEGORY
_1      UINT
"1", 8960156
"10", 4332964
"2", 6828398
"3", 6687301
"4", 7064960
"5", 6081271
"6", 4712491
"7", 5030796
"8", 4944862
"9", 4518738
-- end printing --

doQuery(1=1) -- duration: 20.7933 sec(CPU), 20.7929 sec(elapsed)
/Users/olson/src/snapdragon/src/fastbit/examples/.libs/thula -- duration: 
20.7933 sec(CPU), 20.7929 sec(elapsed)

real    0m20.842s
user    0m19.444s
sys     0m1.399s


On Sep 12, 2012, at 11:31 PM, K. John Wu wrote:

> Hi, Andrew,
> 
> The command-line tool thula has an option -M (or -m) to merge the
> dictionaries of categories from different data partitions.  If you use
> the -M option without any argument, it will attempt to merge the
> dictionaries of all categorical columns of the same names.  Do
> something like
> 
> ../path/to/thula -d top-level-directory -m
> 
> 
> With the same dictionary, it would be possible to use the integer
> values instead of strings for the group-by operations, which should
> make things go a little faster.
> 
> Hope this helps.
> 
> Feel free to let us know if there is any problems with the option -M.
> 
> John
> 
> 
> On 9/12/12 1:51 PM, Olson, Andrew wrote:
>> Hi John,
>> The example query works now, but it's surprisingly slow.  I've pasted the 
>> output of two queries below.  The first takes 21 seconds to "SELECT 
>> FBchr,count(*)".  The second takes 6 seconds to "SELECT 
>> allele_string,count(*)".  The main difference is that FBchr is derived from 
>> a metaTag.  I would expect that the second query should serve as an upper 
>> bound for the first query because allele_string is randomly distributed, and 
>> we know that FBchr can only take on one value in each partition.
>> Andrew
>> 
>> $ time thula -d . -s "FBchr,count(*)" -w "1=1"
>> doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records
>> -- begin printing the result table --
>> Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY 
>> FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows
>> FBchr        CATEGORY
>> _1   UINT
>> "1", 8960156
>> "10", 4332964
>> "2", 6828398
>> "3", 6687301
>> "4", 7064960
>> "5", 6081271
>> "6", 4712491
>> "7", 5030796
>> "8", 4944862
>> "9", 4518738
>> -- end printing --
>> 
>> 
>> real 0m20.884s
>> user 0m19.481s
>> sys  0m1.399s
>> 
>> 
>> $ time thula -d . -s "allele_string,count(*)" -w "1=1"
>> doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records
>> -- begin printing the result table --
>> Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv 
>> (GROUP BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns 
>> and 14 rows
>> allele_string        UINT (dictionary size: 14)
>> _1   UINT
>> "+/-", 1924735
>> "-/+", 1995164
>> "A/C", 1637833
>> "A/G", 5819392
>> "A/T", 2356856
>> "C/A", 3451587
>> "C/G", 1746481
>> "C/T", 12526766
>> "G/A", 13205622
>> "G/C", 1660713
>> -- end printing --
>> 
>> 
>> real 0m6.137s
>> user 0m1.566s
>> sys  0m0.404s
>> 
>> 
>> On Sep 11, 2012, at 11:55 PM, K. John Wu wrote:
>> 
>>> Hi, Andrew,
>>> 
>>> There was a couple of little problems that together cause the thing
>>> you've observed.  They should be fixed in SVN Revision 566.  Please
>>> give it a try and let us know how it works for you.
>>> 
>>> Thanks for reporting the issue.
>>> 
>>> John
>>> 
>>> 
>>> 
>>> On 9/10/12 7:16 PM, Olson, Andrew wrote:
>>>> Hi John,
>>>> I have been storing data in multiple partitions, using metaTags to 
>>>> identify the partitioning.  For example, this query fails because multiple 
>>>> partitions have matches:
>>>> $ thula -d . -s "FBchr,count(*)" -w "1=1"
>>>> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records
>>>> -- begin printing the result table --
>>>> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY 
>>>> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row
>>>> FBchr      UINT (dictionary size: 0)
>>>> _1 UINT
>>>> 1, 331230
>>>> -- end printing --
>>>> 
>>>> And this one works because the matches are all in one partition
>>>> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'"
>>>> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records
>>>> -- begin printing the result table --
>>>> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY 
>>>> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row
>>>> FBchr      UINT (dictionary size: 0)
>>>> _1 UINT
>>>> 1, 51976
>>>> -- end printing --
>>>> 
>>>> I'm not sure if this happens with normal CATEGORY columns when the 
>>>> dictionaries differ between partitions.
>>>> 
>>>> It seems like a bug in some output functions that are using one dictionary 
>>>> for a whole column, rather than partition specific dictionaries.  Would it 
>>>> be useful to have a command line tool that merges dictionaries and updates 
>>>> .int and .idx files across a set of partitions?  This could also remove 
>>>> unused entries from each merged dictionary that don't appear in any of the 
>>>> partitions.
>>>> Andrew
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> 
>> 

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] metaTag dictionary issue

Reply via email to