Hi John, I tried thula with -M to merge the dictionaries. This speeds up the allele_string,count(*) query to 1.65s, which is as great! The FBchr, count(*) query doesn't return any results unless you run with -v (but it still takes 20s that way). Here are the outputs:
$ time thula -d . -s "FBchr,count(*)" -w "1=1" doQuery(1=1) evaluated on T-1 produced 0 hit out of 59161937 records -- begin printing the result table -- Table CAiNa1 (filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...)) contains 0 row and no column -- end printing -- real 0m0.074s user 0m0.038s sys 0m0.035s $ time thula -d . -s "FBchr,count(*)" -w "1=1" -v Constructed a part named 1 Constructed a part named 10 Constructed a part named 2 Constructed a part named 3 Constructed a part named 4 Constructed a part named 5 Constructed a part named 6 Constructed a part named 7 Constructed a part named 8 Constructed a part named 9 mensa::addPartition(.) increases the number partitions from 0 to 10, the number of rows from 0 to 59161937, and the number of columns from 0 to 9 filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 1 countQuery::evaluate -- Select count(*) From 1 Where 1 == 1 --> 8960156 countQuery::evaluate -- duration: 4.6e-05 sec(CPU), 4.60148e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 10 countQuery::evaluate -- Select count(*) From 10 Where 1 == 1 --> 4332964 countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.50611e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 2 countQuery::evaluate -- Select count(*) From 2 Where 1 == 1 --> 6828398 countQuery::evaluate -- duration: 4.5e-05 sec(CPU), 4.60148e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 3 countQuery::evaluate -- Select count(*) From 3 Where 1 == 1 --> 6687301 countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.41074e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 4 countQuery::evaluate -- Select count(*) From 4 Where 1 == 1 --> 7064960 countQuery::evaluate -- duration: 5.3e-05 sec(CPU), 5.38826e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 5 countQuery::evaluate -- Select count(*) From 5 Where 1 == 1 --> 6081271 countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.50611e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 6 countQuery::evaluate -- Select count(*) From 6 Where 1 == 1 --> 4712491 countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.31538e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 7 countQuery::evaluate -- Select count(*) From 7 Where 1 == 1 --> 5030796 countQuery::evaluate -- duration: 4.5e-05 sec(CPU), 4.60148e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 8 countQuery::evaluate -- Select count(*) From 8 Where 1 == 1 --> 4944862 countQuery::evaluate -- duration: 4.4e-05 sec(CPU), 4.41074e-05 sec(elapsed) filter::sift2S(SELECT FBchr,count(*) FROM 10 data partitions WHERE 1 == ...) -- processing data partition 9 countQuery::evaluate -- Select count(*) From 9 Where 1 == 1 --> 4518738 countQuery::evaluate -- duration: 4.3e-05 sec(CPU), 4.3869e-05 sec(elapsed) doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records -- begin printing the result table -- Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows FBchr CATEGORY _1 UINT "1", 8960156 "10", 4332964 "2", 6828398 "3", 6687301 "4", 7064960 "5", 6081271 "6", 4712491 "7", 5030796 "8", 4944862 "9", 4518738 -- end printing -- doQuery(1=1) -- duration: 20.7933 sec(CPU), 20.7929 sec(elapsed) /Users/olson/src/snapdragon/src/fastbit/examples/.libs/thula -- duration: 20.7933 sec(CPU), 20.7929 sec(elapsed) real 0m20.842s user 0m19.444s sys 0m1.399s On Sep 12, 2012, at 11:31 PM, K. John Wu wrote: > Hi, Andrew, > > The command-line tool thula has an option -M (or -m) to merge the > dictionaries of categories from different data partitions. If you use > the -M option without any argument, it will attempt to merge the > dictionaries of all categorical columns of the same names. Do > something like > > ../path/to/thula -d top-level-directory -m > > > With the same dictionary, it would be possible to use the integer > values instead of strings for the group-by operations, which should > make things go a little faster. > > Hope this helps. > > Feel free to let us know if there is any problems with the option -M. > > John > > > On 9/12/12 1:51 PM, Olson, Andrew wrote: >> Hi John, >> The example query works now, but it's surprisingly slow. I've pasted the >> output of two queries below. The first takes 21 seconds to "SELECT >> FBchr,count(*)". The second takes 6 seconds to "SELECT >> allele_string,count(*)". The main difference is that FBchr is derived from >> a metaTag. I would expect that the second query should serve as an upper >> bound for the first query because allele_string is randomly distributed, and >> we know that FBchr can only take on one value in each partition. >> Andrew >> >> $ time thula -d . -s "FBchr,count(*)" -w "1=1" >> doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records >> -- begin printing the result table -- >> Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY >> FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows >> FBchr CATEGORY >> _1 UINT >> "1", 8960156 >> "10", 4332964 >> "2", 6828398 >> "3", 6687301 >> "4", 7064960 >> "5", 6081271 >> "6", 4712491 >> "7", 5030796 >> "8", 4944862 >> "9", 4518738 >> -- end printing -- >> >> >> real 0m20.884s >> user 0m19.481s >> sys 0m1.399s >> >> >> $ time thula -d . -s "allele_string,count(*)" -w "1=1" >> doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records >> -- begin printing the result table -- >> Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv >> (GROUP BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns >> and 14 rows >> allele_string UINT (dictionary size: 14) >> _1 UINT >> "+/-", 1924735 >> "-/+", 1995164 >> "A/C", 1637833 >> "A/G", 5819392 >> "A/T", 2356856 >> "C/A", 3451587 >> "C/G", 1746481 >> "C/T", 12526766 >> "G/A", 13205622 >> "G/C", 1660713 >> -- end printing -- >> >> >> real 0m6.137s >> user 0m1.566s >> sys 0m0.404s >> >> >> On Sep 11, 2012, at 11:55 PM, K. John Wu wrote: >> >>> Hi, Andrew, >>> >>> There was a couple of little problems that together cause the thing >>> you've observed. They should be fixed in SVN Revision 566. Please >>> give it a try and let us know how it works for you. >>> >>> Thanks for reporting the issue. >>> >>> John >>> >>> >>> >>> On 9/10/12 7:16 PM, Olson, Andrew wrote: >>>> Hi John, >>>> I have been storing data in multiple partitions, using metaTags to >>>> identify the partitioning. For example, this query fails because multiple >>>> partitions have matches: >>>> $ thula -d . -s "FBchr,count(*)" -w "1=1" >>>> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records >>>> -- begin printing the result table -- >>>> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY >>>> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row >>>> FBchr UINT (dictionary size: 0) >>>> _1 UINT >>>> 1, 331230 >>>> -- end printing -- >>>> >>>> And this one works because the matches are all in one partition >>>> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'" >>>> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records >>>> -- begin printing the result table -- >>>> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY >>>> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row >>>> FBchr UINT (dictionary size: 0) >>>> _1 UINT >>>> 1, 51976 >>>> -- end printing -- >>>> >>>> I'm not sure if this happens with normal CATEGORY columns when the >>>> dictionaries differ between partitions. >>>> >>>> It seems like a bug in some output functions that are using one dictionary >>>> for a whole column, rather than partition specific dictionaries. Would it >>>> be useful to have a command line tool that merges dictionaries and updates >>>> .int and .idx files across a set of partitions? This could also remove >>>> unused entries from each merged dictionary that don't appear in any of the >>>> partitions. >>>> Andrew >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>> >> _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
