It's not universally true that a tokenized field cannot be used as a sort field, but it is true that you will not get the desired sort order except in special cases:

Lucene's indexes of course contain inverted tables which map Term -> DocumentID, DocumentID, ... But for sorting, once a set of Document IDs have been selected, the respective Term values are used as an ordering key. In order to do that, the first time a field is referenced for sorting a FieldCache table is allocated and pre-filled with Document -> Term mappings. For indexed text which is tokenized into multiple Terms, only the first one is retained. This is done for efficiency concerns (lookup speed and memory utilization).

So for say a title field you had indexed strings such as:

The Turkey and its Predators
Turkey Cooking made Easy
Turkeys and their Discontent

Assuming the typical analysis steps of case folding, stopword removal, depunctuation, depluralization, etc. the indexed Terms would be something on the order of:

turkey / predator
turkey / cooking / made / easy
turkey / their / discontent

but sorting would only use the initial token 'turkey' for the title field, and all such documents starting with turkey would be randomly (Document ID) ordered in the hitlist — subject of course to any subsequent sorting stages. Which is likely NOT what you would want for title sorting.

Rather, you would certainly want to retain case folding, and probably retain stopword removal and depunctuation and maybe depluralization (perhaps with the rules somewhat altered from the field variant used for searching), but turn off any tokenization, and an operations like synonym substitution/enhancement that could alter the sort order in user-unexpected ways.

Does the proviso make more sense now?

- J.J. Larrea

On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote:


I am looking at adding some sorting functionality to my application and read that Sort fields should not be tokenized - can anyone explain why?
I have code that is tokenizing the sort fields and it seems to be
working. Is it just because some tokenizing can change the value (like
remove stop words and such) which can produce an invalid sort order?
Thanks.

-Jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to