Hi John,
Thanks for the explanation.  I would like to be able to find partial matches as 
well as perfect matches.
For example:
SELECT a,b,c,t WHERE t like "%ABC123%" AND 1000<a<2000 ...

I realize that this is not supported currently, so I was thinking of doing this:
1. add another column (t_pos) to the table holding the start positions of t.
2. create a suffix array based on these positions
3. use the suffix array to find matching positions (fill an array *pos)
4. query the table/partition using ibis::qDiscreteRange dr("t_pos", *pos);
5. apply other conditions to results.

Do you think this sounds like a reasonable approach?  If so, do you think it is 
a sensible way to add support for this kind of query on 'text' or 'key' columns?

Andrew

On Mar 31, 2010, at 1:57 PM, K. John Wu wrote:

> Hi, Andrew,
> 
> Here is a quick explanation of what might be going on
> 
> - when the string-valued column is treated as 'text', the current code 
> simply performs string comparisons on each row to resolve the 
> condition involved.  Guess in this case it took 3 seconds.
> 
> - when the string-valued column is treated as 'key', a dictionary is 
> built along with an index file (hence the .dic and .idx files). 
> However, it has a pretty dumb ways of handling the dictionary, which 
> is currently very slow.  The 20 seconds time you've observed should be 
> mostly to reconstruct this dictionary, which only happens once for a 
> data partition.  We definitely need to figure out a better 
> implementation here.  While we are figuring out how to implement the 
> dictionary better, one thing you can do it to avoid running a single 
> query at a time.  Alternatively, you might consider converting the 
> strings into integers yourself.
> 
> - The .tdlist file is for a different kind of searches.  The 
> short-hand tdlist is for term-document list following the terminology 
> of web search.  This is meant for finding keywords in long strings 
> such as web pages (where a whole web page as a single string).
> 
> John
> 
> 
> On 3/30/2010 10:47 PM, Andrew Olson wrote:
>> I have a 68 million row table with 5 integer columns and 1 column
>> of text labels.  What is the best way to index the text column?
>> Cardinality is high (9.7 million distinct values).  The default
>> treatment creates a .sp file only and takes over 3 seconds to run
>> (integer columns are<  .01 seconds).  Switching to datatype=
>> keyword creates .dic and .idx files but takes over 20 seconds.
>> Should I be using a .tdlist file?
>> 
>> my queries are of the form "select a,b,c where t='asdf123'"
>> 
>> 
>> Andrew
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to