Hi, Andrew, The command-line tool thula has an option -M (or -m) to merge the dictionaries of categories from different data partitions. If you use the -M option without any argument, it will attempt to merge the dictionaries of all categorical columns of the same names. Do something like
../path/to/thula -d top-level-directory -m With the same dictionary, it would be possible to use the integer values instead of strings for the group-by operations, which should make things go a little faster. Hope this helps. Feel free to let us know if there is any problems with the option -M. John On 9/12/12 1:51 PM, Olson, Andrew wrote: > Hi John, > The example query works now, but it's surprisingly slow. I've pasted the > output of two queries below. The first takes 21 seconds to "SELECT > FBchr,count(*)". The second takes 6 seconds to "SELECT > allele_string,count(*)". The main difference is that FBchr is derived from a > metaTag. I would expect that the second query should serve as an upper bound > for the first query because allele_string is randomly distributed, and we > know that FBchr can only take on one value in each partition. > Andrew > > $ time thula -d . -s "FBchr,count(*)" -w "1=1" > doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records > -- begin printing the result table -- > Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY > FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows > FBchr CATEGORY > _1 UINT > "1", 8960156 > "10", 4332964 > "2", 6828398 > "3", 6687301 > "4", 7064960 > "5", 6081271 > "6", 4712491 > "7", 5030796 > "8", 4944862 > "9", 4518738 > -- end printing -- > > > real 0m20.884s > user 0m19.481s > sys 0m1.399s > > > $ time thula -d . -s "allele_string,count(*)" -w "1=1" > doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records > -- begin printing the result table -- > Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv > (GROUP BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns > and 14 rows > allele_string UINT (dictionary size: 14) > _1 UINT > "+/-", 1924735 > "-/+", 1995164 > "A/C", 1637833 > "A/G", 5819392 > "A/T", 2356856 > "C/A", 3451587 > "C/G", 1746481 > "C/T", 12526766 > "G/A", 13205622 > "G/C", 1660713 > -- end printing -- > > > real 0m6.137s > user 0m1.566s > sys 0m0.404s > > > On Sep 11, 2012, at 11:55 PM, K. John Wu wrote: > >> Hi, Andrew, >> >> There was a couple of little problems that together cause the thing >> you've observed. They should be fixed in SVN Revision 566. Please >> give it a try and let us know how it works for you. >> >> Thanks for reporting the issue. >> >> John >> >> >> >> On 9/10/12 7:16 PM, Olson, Andrew wrote: >>> Hi John, >>> I have been storing data in multiple partitions, using metaTags to identify >>> the partitioning. For example, this query fails because multiple >>> partitions have matches: >>> $ thula -d . -s "FBchr,count(*)" -w "1=1" >>> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records >>> -- begin printing the result table -- >>> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY >>> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row >>> FBchr UINT (dictionary size: 0) >>> _1 UINT >>> 1, 331230 >>> -- end printing -- >>> >>> And this one works because the matches are all in one partition >>> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'" >>> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records >>> -- begin printing the result table -- >>> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY >>> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row >>> FBchr UINT (dictionary size: 0) >>> _1 UINT >>> 1, 51976 >>> -- end printing -- >>> >>> I'm not sure if this happens with normal CATEGORY columns when the >>> dictionaries differ between partitions. >>> >>> It seems like a bug in some output functions that are using one dictionary >>> for a whole column, rather than partition specific dictionaries. Would it >>> be useful to have a command line tool that merges dictionaries and updates >>> .int and .idx files across a set of partitions? This could also remove >>> unused entries from each merged dictionary that don't appear in any of the >>> partitions. >>> Andrew >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> > _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
