Re: [FastBit-users] metaTag dictionary issue

K. John Wu Wed, 12 Sep 2012 20:32:14 -0700

Hi, Andrew,

The command-line tool thula has an option -M (or -m) to merge the
dictionaries of categories from different data partitions.  If you use
the -M option without any argument, it will attempt to merge the
dictionaries of all categorical columns of the same names.  Do
something like


../path/to/thula -d top-level-directory -m


With the same dictionary, it would be possible to use the integer
values instead of strings for the group-by operations, which should
make things go a little faster.

Hope this helps.

Feel free to let us know if there is any problems with the option -M.

John


On 9/12/12 1:51 PM, Olson, Andrew wrote:
> Hi John,
> The example query works now, but it's surprisingly slow.  I've pasted the 
> output of two queries below.  The first takes 21 seconds to "SELECT 
> FBchr,count(*)".  The second takes 6 seconds to "SELECT 
> allele_string,count(*)".  The main difference is that FBchr is derived from a 
> metaTag.  I would expect that the second query should serve as an upper bound 
> for the first query because allele_string is randomly distributed, and we 
> know that FBchr can only take on one value in each partition.
> Andrew
> 
> $ time thula -d . -s "FBchr,count(*)" -w "1=1"
> doQuery(1=1) evaluated on T-1 produced 10 hits out of 59161937 records
> -- begin printing the result table --
> Table (in memory) _tiwE (GROUP BY FBchr,count(*) on table RTaA12 (GROUP BY 
> FBchr, COUNT(*) on table CAiNa1)) contsists of 2 columns and 10 rows
> FBchr CATEGORY
> _1    UINT
> "1", 8960156
> "10", 4332964
> "2", 6828398
> "3", 6687301
> "4", 7064960
> "5", 6081271
> "6", 4712491
> "7", 5030796
> "8", 4944862
> "9", 4518738
> -- end printing --
> 
> 
> real  0m20.884s
> user  0m19.481s
> sys   0m1.399s
> 
> 
> $ time thula -d . -s "allele_string,count(*)" -w "1=1"
> doQuery(1=1) evaluated on T-1 produced 14 hits out of 59161937 records
> -- begin printing the result table --
> Table (in memory) ducX1 (GROUP BY allele_string,count(*) on table h7dvv 
> (GROUP BY allele_string, COUNT(*) on table lUPgU2)) contsists of 2 columns 
> and 14 rows
> allele_string UINT (dictionary size: 14)
> _1    UINT
> "+/-", 1924735
> "-/+", 1995164
> "A/C", 1637833
> "A/G", 5819392
> "A/T", 2356856
> "C/A", 3451587
> "C/G", 1746481
> "C/T", 12526766
> "G/A", 13205622
> "G/C", 1660713
> -- end printing --
> 
> 
> real  0m6.137s
> user  0m1.566s
> sys   0m0.404s
> 
> 
> On Sep 11, 2012, at 11:55 PM, K. John Wu wrote:
> 
>> Hi, Andrew,
>>
>> There was a couple of little problems that together cause the thing
>> you've observed.  They should be fixed in SVN Revision 566.  Please
>> give it a try and let us know how it works for you.
>>
>> Thanks for reporting the issue.
>>
>> John
>>
>>
>>
>> On 9/10/12 7:16 PM, Olson, Andrew wrote:
>>> Hi John,
>>> I have been storing data in multiple partitions, using metaTags to identify 
>>> the partitioning.  For example, this query fails because multiple 
>>> partitions have matches:
>>> $ thula -d . -s "FBchr,count(*)" -w "1=1"
>>> doQuery(1=1) evaluated on T-1 produced 1 hit out of 331230 records
>>> -- begin printing the result table --
>>> Table (in memory) _8PVC (GROUP BY FBchr,count(*) on table SF5D42 (GROUP BY 
>>> FBchr, COUNT(*) on table OAiQa1)) contsists of 2 columns and 1 row
>>> FBchr       UINT (dictionary size: 0)
>>> _1  UINT
>>> 1, 331230
>>> -- end printing --
>>>
>>> And this one works because the matches are all in one partition
>>> $ thula -d . -s "FBchr,count(*)" -w "FBchr='1'"
>>> doQuery(FBchr='1') evaluated on T-1 produced 1 hit out of 331230 records
>>> -- begin printing the result table --
>>> Table (in memory) UIpJq2 (GROUP BY FBchr,count(*) on table o0Lu8 (GROUP BY 
>>> FBchr, COUNT(*) on table _qULt2)) contsists of 2 columns and 1 row
>>> FBchr       UINT (dictionary size: 0)
>>> _1  UINT
>>> 1, 51976
>>> -- end printing --
>>>
>>> I'm not sure if this happens with normal CATEGORY columns when the 
>>> dictionaries differ between partitions.
>>>
>>> It seems like a bug in some output functions that are using one dictionary 
>>> for a whole column, rather than partition specific dictionaries.  Would it 
>>> be useful to have a command line tool that merges dictionaries and updates 
>>> .int and .idx files across a set of partitions?  This could also remove 
>>> unused entries from each merged dictionary that don't appear in any of the 
>>> partitions.
>>> Andrew
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] metaTag dictionary issue

Reply via email to