Thanks - so if my sort field is a single term then I should be ok with using an analyzer (to lowercase it for example).
-Jeff -----Original Message----- From: J.J. Larrea [mailto:j...@panix.com] Sent: Monday, November 16, 2009 11:19 AM To: java-user@lucene.apache.org Subject: Re: Sort fields shouldn't be tokenized It's not universally true that a tokenized field cannot be used as a sort field, but it is true that you will not get the desired sort order except in special cases: Lucene's indexes of course contain inverted tables which map Term -> DocumentID, DocumentID, ... But for sorting, once a set of Document IDs have been selected, the respective Term values are used as an ordering key. In order to do that, the first time a field is referenced for sorting a FieldCache table is allocated and pre-filled with Document -> Term mappings. For indexed text which is tokenized into multiple Terms, only the first one is retained. This is done for efficiency concerns (lookup speed and memory utilization). So for say a title field you had indexed strings such as: The Turkey and its Predators Turkey Cooking made Easy Turkeys and their Discontent Assuming the typical analysis steps of case folding, stopword removal, depunctuation, depluralization, etc. the indexed Terms would be something on the order of: turkey / predator turkey / cooking / made / easy turkey / their / discontent but sorting would only use the initial token 'turkey' for the title field, and all such documents starting with turkey would be randomly (Document ID) ordered in the hitlist - subject of course to any subsequent sorting stages. Which is likely NOT what you would want for title sorting. Rather, you would certainly want to retain case folding, and probably retain stopword removal and depunctuation and maybe depluralization (perhaps with the rules somewhat altered from the field variant used for searching), but turn off any tokenization, and an operations like synonym substitution/enhancement that could alter the sort order in user-unexpected ways. Does the proviso make more sense now? - J.J. Larrea On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote: > > I am looking at adding some sorting functionality to my application > and > read that Sort fields should not be tokenized - can anyone explain > why? > I have code that is tokenizing the sort fields and it seems to be > working. Is it just because some tokenizing can change the value > (like > remove stop words and such) which can produce an invalid sort order? > Thanks. > > -Jeff > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org