Hi, Andrew, Here is a quick explanation of what might be going on
- when the string-valued column is treated as 'text', the current code simply performs string comparisons on each row to resolve the condition involved. Guess in this case it took 3 seconds. - when the string-valued column is treated as 'key', a dictionary is built along with an index file (hence the .dic and .idx files). However, it has a pretty dumb ways of handling the dictionary, which is currently very slow. The 20 seconds time you've observed should be mostly to reconstruct this dictionary, which only happens once for a data partition. We definitely need to figure out a better implementation here. While we are figuring out how to implement the dictionary better, one thing you can do it to avoid running a single query at a time. Alternatively, you might consider converting the strings into integers yourself. - The .tdlist file is for a different kind of searches. The short-hand tdlist is for term-document list following the terminology of web search. This is meant for finding keywords in long strings such as web pages (where a whole web page as a single string). John On 3/30/2010 10:47 PM, Andrew Olson wrote: > I have a 68 million row table with 5 integer columns and 1 column > of text labels. What is the best way to index the text column? > Cardinality is high (9.7 million distinct values). The default > treatment creates a .sp file only and takes over 3 seconds to run > (integer columns are< .01 seconds). Switching to datatype= > keyword creates .dic and .idx files but takes over 20 seconds. > Should I be using a .tdlist file? > > my queries are of the form "select a,b,c where t='asdf123'" > > > Andrew _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
