Hi John,
The example query works now, but it's surprisingly slow.  I've pasted the 
output of two queries below.  The first takes 21 seconds to "SELECT 
FBchr,count(*)".  The second takes 6 seconds to "SELECT 
allele_string,count(*)".  The main difference is that FBchr is derived from a 
metaTag.  I would expect that the second query should serve as an upper bound 
for the first query because allele_string is randomly distributed, and we know 
that FBchr can only take on one value in each partition.
Andrew

$ time thula -d . -s "FBchr,count(*)" -w "1=1"
doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records
-- begin printing the result table --
Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY 
FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows
FBchr   CATEGORY
_1      UINT
"1", 8960156
"10", 4332964
"2", 6828398
"3", 6687301
"4", 7064960
"5", 6081271
"6", 4712491
"7", 5030796
"8", 4944862
"9", 4518738
-- end printing --


real    0m20.884s
user    0m19.481s
sys     0m1.399s


$ time thula -d . -s "allele_string,count(*)" -w "1=1"
doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records
-- begin printing the result table --
Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv (GROUP 
BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns and 14 rows
allele_string   UINT (dictionary size: 14)
_1      UINT
"+/-", 1924735
"-/+", 1995164
"A/C", 1637833
"A/G", 5819392
"A/T", 2356856
"C/A", 3451587
"C/G", 1746481
"C/T", 12526766
"G/A", 13205622
"G/C", 1660713
-- end printing --


real    0m6.137s
user    0m1.566s
sys     0m0.404s


On Sep 11, 2012, at 11:55 PM, K. John Wu wrote:

> Hi, Andrew,
> 
> There was a couple of little problems that together cause the thing
> you've observed.  They should be fixed in SVN Revision 566.  Please
> give it a try and let us know how it works for you.
> 
> Thanks for reporting the issue.
> 
> John
> 
> 
> 
> On 9/10/12 7:16 PM, Olson, Andrew wrote:
>> Hi John,
>> I have been storing data in multiple partitions, using metaTags to identify 
>> the partitioning.  For example, this query fails because multiple partitions 
>> have matches:
>> $ thula -d . -s "FBchr,count(*)" -w "1=1"
>> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records
>> -- begin printing the result table --
>> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY 
>> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row
>> FBchr        UINT (dictionary size: 0)
>> _1   UINT
>> 1, 331230
>> -- end printing --
>> 
>> And this one works because the matches are all in one partition
>> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'"
>> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records
>> -- begin printing the result table --
>> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY 
>> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row
>> FBchr        UINT (dictionary size: 0)
>> _1   UINT
>> 1, 51976
>> -- end printing --
>> 
>> I'm not sure if this happens with normal CATEGORY columns when the 
>> dictionaries differ between partitions.
>> 
>> It seems like a bug in some output functions that are using one dictionary 
>> for a whole column, rather than partition specific dictionaries.  Would it 
>> be useful to have a command line tool that merges dictionaries and updates 
>> .int and .idx files across a set of partitions?  This could also remove 
>> unused entries from each merged dictionary that don't appear in any of the 
>> partitions.
>> Andrew
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> 

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to