Hi, Andrew,

Here is a quick explanation of what might be going on

- when the string-valued column is treated as 'text', the current code 
simply performs string comparisons on each row to resolve the 
condition involved.  Guess in this case it took 3 seconds.

- when the string-valued column is treated as 'key', a dictionary is 
built along with an index file (hence the .dic and .idx files). 
However, it has a pretty dumb ways of handling the dictionary, which 
is currently very slow.  The 20 seconds time you've observed should be 
mostly to reconstruct this dictionary, which only happens once for a 
data partition.  We definitely need to figure out a better 
implementation here.  While we are figuring out how to implement the 
dictionary better, one thing you can do it to avoid running a single 
query at a time.  Alternatively, you might consider converting the 
strings into integers yourself.

- The .tdlist file is for a different kind of searches.  The 
short-hand tdlist is for term-document list following the terminology 
of web search.  This is meant for finding keywords in long strings 
such as web pages (where a whole web page as a single string).

John


On 3/30/2010 10:47 PM, Andrew Olson wrote:
> I have a 68 million row table with 5 integer columns and 1 column
> of text labels.  What is the best way to index the text column?
> Cardinality is high (9.7 million distinct values).  The default
> treatment creates a .sp file only and takes over 3 seconds to run
> (integer columns are<  .01 seconds).  Switching to datatype=
> keyword creates .dic and .idx files but takes over 20 seconds.
> Should I be using a .tdlist file?
>
> my queries are of the form "select a,b,c where t='asdf123'"
>
>
> Andrew
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to