Re: [FastBit-users] slow text searching?

K. John Wu Thu, 01 Apr 2010 15:41:49 -0700

Hi, Andrew,

Forgot to mention that the like operator is supported by FastBit right 
now.  The implementation is not fast, but you can give it a try and 
let us know how it works for you.


John


On 3/31/2010 11:32 AM, Andrew Olson wrote:
> Hi John,
> Thanks for the explanation.  I would like to be able to find partial matches 
> as well as perfect matches.
> For example:
> SELECT a,b,c,t WHERE t like "%ABC123%" AND 1000<a<2000 ...
>
> I realize that this is not supported currently, so I was thinking of doing 
> this:
> 1. add another column (t_pos) to the table holding the start positions of t.
> 2. create a suffix array based on these positions
> 3. use the suffix array to find matching positions (fill an array *pos)
> 4. query the table/partition using ibis::qDiscreteRange dr("t_pos", *pos);
> 5. apply other conditions to results.
>
> Do you think this sounds like a reasonable approach?  If so, do you think it 
> is a sensible way to add support for this kind of query on 'text' or 'key' 
> columns?
>
> Andrew
>
> On Mar 31, 2010, at 1:57 PM, K. John Wu wrote:
>
>> Hi, Andrew,
>>
>> Here is a quick explanation of what might be going on
>>
>> - when the string-valued column is treated as 'text', the current code
>> simply performs string comparisons on each row to resolve the
>> condition involved.  Guess in this case it took 3 seconds.
>>
>> - when the string-valued column is treated as 'key', a dictionary is
>> built along with an index file (hence the .dic and .idx files).
>> However, it has a pretty dumb ways of handling the dictionary, which
>> is currently very slow.  The 20 seconds time you've observed should be
>> mostly to reconstruct this dictionary, which only happens once for a
>> data partition.  We definitely need to figure out a better
>> implementation here.  While we are figuring out how to implement the
>> dictionary better, one thing you can do it to avoid running a single
>> query at a time.  Alternatively, you might consider converting the
>> strings into integers yourself.
>>
>> - The .tdlist file is for a different kind of searches.  The
>> short-hand tdlist is for term-document list following the terminology
>> of web search.  This is meant for finding keywords in long strings
>> such as web pages (where a whole web page as a single string).
>>
>> John
>>
>>
>> On 3/30/2010 10:47 PM, Andrew Olson wrote:
>>> I have a 68 million row table with 5 integer columns and 1 column
>>> of text labels.  What is the best way to index the text column?
>>> Cardinality is high (9.7 million distinct values).  The default
>>> treatment creates a .sp file only and takes over 3 seconds to run
>>> (integer columns are<   .01 seconds).  Switching to datatype=
>>> keyword creates .dic and .idx files but takes over 20 seconds.
>>> Should I be using a .tdlist file?
>>>
>>> my queries are of the form "select a,b,c where t='asdf123'"
>>>
>>>
>>> Andrew
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] slow text searching?

Reply via email to