Re: [FastBit-users] slow text searching?

K. John Wu Wed, 31 Mar 2010 16:57:18 -0700

Hi, Andrew,

Sounds like an interesting idea.  It probably will take some work to 
get it working right.  For example, the discrete range condition 
probably does not do what you expect.  Essentially, you need something 
that gives you the following


foreach p in positions returned from the suffix array, find the row 
number i such that string-i-start <= p <= string-i-end

The discrete range will be doing equality comparisons...

Overall, an intriguing idea, should worth some effort to investigate 
further.

John


On 3/31/2010 11:32 AM, Andrew Olson wrote:
> Hi John,
> Thanks for the explanation.  I would like to be able to find partial matches 
> as well as perfect matches.
> For example:
> SELECT a,b,c,t WHERE t like "%ABC123%" AND 1000<a<2000 ...
>
> I realize that this is not supported currently, so I was thinking of doing 
> this:
> 1. add another column (t_pos) to the table holding the start positions of t.
> 2. create a suffix array based on these positions
> 3. use the suffix array to find matching positions (fill an array *pos)
> 4. query the table/partition using ibis::qDiscreteRange dr("t_pos", *pos);
> 5. apply other conditions to results.
>
> Do you think this sounds like a reasonable approach?  If so, do you think it 
> is a sensible way to add support for this kind of query on 'text' or 'key' 
> columns?
>
> Andrew
>
> On Mar 31, 2010, at 1:57 PM, K. John Wu wrote:
>
>> Hi, Andrew,
>>
>> Here is a quick explanation of what might be going on
>>
>> - when the string-valued column is treated as 'text', the current code
>> simply performs string comparisons on each row to resolve the
>> condition involved.  Guess in this case it took 3 seconds.
>>
>> - when the string-valued column is treated as 'key', a dictionary is
>> built along with an index file (hence the .dic and .idx files).
>> However, it has a pretty dumb ways of handling the dictionary, which
>> is currently very slow.  The 20 seconds time you've observed should be
>> mostly to reconstruct this dictionary, which only happens once for a
>> data partition.  We definitely need to figure out a better
>> implementation here.  While we are figuring out how to implement the
>> dictionary better, one thing you can do it to avoid running a single
>> query at a time.  Alternatively, you might consider converting the
>> strings into integers yourself.
>>
>> - The .tdlist file is for a different kind of searches.  The
>> short-hand tdlist is for term-document list following the terminology
>> of web search.  This is meant for finding keywords in long strings
>> such as web pages (where a whole web page as a single string).
>>
>> John
>>
>>
>> On 3/30/2010 10:47 PM, Andrew Olson wrote:
>>> I have a 68 million row table with 5 integer columns and 1 column
>>> of text labels.  What is the best way to index the text column?
>>> Cardinality is high (9.7 million distinct values).  The default
>>> treatment creates a .sp file only and takes over 3 seconds to run
>>> (integer columns are<   .01 seconds).  Switching to datatype=
>>> keyword creates .dic and .idx files but takes over 20 seconds.
>>> Should I be using a .tdlist file?
>>>
>>> my queries are of the form "select a,b,c where t='asdf123'"
>>>
>>>
>>> Andrew
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] slow text searching?

Reply via email to