Hi John,
Thanks for the explanation. I would like to be able to find partial matches as
well as perfect matches.
For example:
SELECT a,b,c,t WHERE t like "%ABC123%" AND 1000<a<2000 ...
I realize that this is not supported currently, so I was thinking of doing this:
1. add another column (t_pos) to the table holding the start positions of t.
2. create a suffix array based on these positions
3. use the suffix array to find matching positions (fill an array *pos)
4. query the table/partition using ibis::qDiscreteRange dr("t_pos", *pos);
5. apply other conditions to results.
Do you think this sounds like a reasonable approach? If so, do you think it is
a sensible way to add support for this kind of query on 'text' or 'key' columns?
Andrew
On Mar 31, 2010, at 1:57 PM, K. John Wu wrote:
> Hi, Andrew,
>
> Here is a quick explanation of what might be going on
>
> - when the string-valued column is treated as 'text', the current code
> simply performs string comparisons on each row to resolve the
> condition involved. Guess in this case it took 3 seconds.
>
> - when the string-valued column is treated as 'key', a dictionary is
> built along with an index file (hence the .dic and .idx files).
> However, it has a pretty dumb ways of handling the dictionary, which
> is currently very slow. The 20 seconds time you've observed should be
> mostly to reconstruct this dictionary, which only happens once for a
> data partition. We definitely need to figure out a better
> implementation here. While we are figuring out how to implement the
> dictionary better, one thing you can do it to avoid running a single
> query at a time. Alternatively, you might consider converting the
> strings into integers yourself.
>
> - The .tdlist file is for a different kind of searches. The
> short-hand tdlist is for term-document list following the terminology
> of web search. This is meant for finding keywords in long strings
> such as web pages (where a whole web page as a single string).
>
> John
>
>
> On 3/30/2010 10:47 PM, Andrew Olson wrote:
>> I have a 68 million row table with 5 integer columns and 1 column
>> of text labels. What is the best way to index the text column?
>> Cardinality is high (9.7 million distinct values). The default
>> treatment creates a .sp file only and takes over 3 seconds to run
>> (integer columns are< .01 seconds). Switching to datatype=
>> keyword creates .dic and .idx files but takes over 20 seconds.
>> Should I be using a .tdlist file?
>>
>> my queries are of the form "select a,b,c where t='asdf123'"
>>
>>
>> Andrew
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users