It's not universally true that a tokenized field cannot be used as a
sort field, but it is true that you will not get the desired sort
order except in special cases:
Lucene's indexes of course contain inverted tables which map Term ->
DocumentID, DocumentID, ...
But for sorting, once a set of Document IDs have been selected, the
respective Term values are used as an ordering key.
In order to do that, the first time a field is referenced for sorting
a FieldCache table is allocated and pre-filled with Document -> Term
mappings.
For indexed text which is tokenized into multiple Terms, only the
first one is retained. This is done for efficiency concerns (lookup
speed and memory utilization).
So for say a title field you had indexed strings such as:
The Turkey and its Predators
Turkey Cooking made Easy
Turkeys and their Discontent
Assuming the typical analysis steps of case folding, stopword removal,
depunctuation, depluralization, etc. the indexed Terms would be
something on the order of:
turkey / predator
turkey / cooking / made / easy
turkey / their / discontent
but sorting would only use the initial token 'turkey' for the title
field, and all such documents starting with turkey would be randomly
(Document ID) ordered in the hitlist — subject of course to any
subsequent sorting stages. Which is likely NOT what you would want
for title sorting.
Rather, you would certainly want to retain case folding, and probably
retain stopword removal and depunctuation and maybe depluralization
(perhaps with the rules somewhat altered from the field variant used
for searching), but turn off any tokenization, and an operations like
synonym substitution/enhancement that could alter the sort order in
user-unexpected ways.
Does the proviso make more sense now?
- J.J. Larrea
On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote:
I am looking at adding some sorting functionality to my application
and
read that Sort fields should not be tokenized - can anyone explain
why?
I have code that is tokenizing the sort fields and it seems to be
working. Is it just because some tokenizing can change the value
(like
remove stop words and such) which can produce an invalid sort order?
Thanks.
-Jeff
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org