Hi John, The example query works now, but it's surprisingly slow. I've pasted the output of two queries below. The first takes 21 seconds to "SELECT FBchr,count(*)". The second takes 6 seconds to "SELECT allele_string,count(*)". The main difference is that FBchr is derived from a metaTag. I would expect that the second query should serve as an upper bound for the first query because allele_string is randomly distributed, and we know that FBchr can only take on one value in each partition. Andrew
$ time thula -d . -s "FBchr,count(*)" -w "1=1" doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records -- begin printing the result table -- Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows FBchr CATEGORY _1 UINT "1", 8960156 "10", 4332964 "2", 6828398 "3", 6687301 "4", 7064960 "5", 6081271 "6", 4712491 "7", 5030796 "8", 4944862 "9", 4518738 -- end printing -- real 0m20.884s user 0m19.481s sys 0m1.399s $ time thula -d . -s "allele_string,count(*)" -w "1=1" doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records -- begin printing the result table -- Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv (GROUP BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns and 14 rows allele_string UINT (dictionary size: 14) _1 UINT "+/-", 1924735 "-/+", 1995164 "A/C", 1637833 "A/G", 5819392 "A/T", 2356856 "C/A", 3451587 "C/G", 1746481 "C/T", 12526766 "G/A", 13205622 "G/C", 1660713 -- end printing -- real 0m6.137s user 0m1.566s sys 0m0.404s On Sep 11, 2012, at 11:55 PM, K. John Wu wrote: > Hi, Andrew, > > There was a couple of little problems that together cause the thing > you've observed. They should be fixed in SVN Revision 566. Please > give it a try and let us know how it works for you. > > Thanks for reporting the issue. > > John > > > > On 9/10/12 7:16 PM, Olson, Andrew wrote: >> Hi John, >> I have been storing data in multiple partitions, using metaTags to identify >> the partitioning. For example, this query fails because multiple partitions >> have matches: >> $ thula -d . -s "FBchr,count(*)" -w "1=1" >> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records >> -- begin printing the result table -- >> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY >> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row >> FBchr UINT (dictionary size: 0) >> _1 UINT >> 1, 331230 >> -- end printing -- >> >> And this one works because the matches are all in one partition >> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'" >> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records >> -- begin printing the result table -- >> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY >> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row >> FBchr UINT (dictionary size: 0) >> _1 UINT >> 1, 51976 >> -- end printing -- >> >> I'm not sure if this happens with normal CATEGORY columns when the >> dictionaries differ between partitions. >> >> It seems like a bug in some output functions that are using one dictionary >> for a whole column, rather than partition specific dictionaries. Would it >> be useful to have a command line tool that merges dictionaries and updates >> .int and .idx files across a set of partitions? This could also remove >> unused entries from each merged dictionary that don't appear in any of the >> partitions. >> Andrew >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
