Re: Sort fields shouldn't be tokenized

J.J. Larrea Mon, 16 Nov 2009 08:19:40 -0800

It's not universally true that a tokenized field cannot be used as asort field, but it is true that you will not get the desired sortorder except in special cases:

Lucene's indexes of course contain inverted tables which map Term ->DocumentID, DocumentID, ...But for sorting, once a set of Document IDs have been selected, therespective Term values are used as an ordering key.In order to do that, the first time a field is referenced for sortinga FieldCache table is allocated and pre-filled with Document -> Termmappings.For indexed text which is tokenized into multiple Terms, only thefirst one is retained. This is done for efficiency concerns (lookupspeed and memory utilization).


So for say a title field you had indexed strings such as:

The Turkey and its Predators
Turkey Cooking made Easy
Turkeys and their Discontent

Assuming the typical analysis steps of case folding, stopword removal,depunctuation, depluralization, etc. the indexed Terms would besomething on the order of:


turkey / predator
turkey / cooking / made / easy
turkey / their / discontent

but sorting would only use the initial token 'turkey' for the titlefield, and all such documents starting with turkey would be randomly(Document ID) ordered in the hitlist — subject of course to anysubsequent sorting stages. Which is likely NOT what you would wantfor title sorting.

Rather, you would certainly want to retain case folding, and probablyretain stopword removal and depunctuation and maybe depluralization(perhaps with the rules somewhat altered from the field variant usedfor searching), but turn off any tokenization, and an operations likesynonym substitution/enhancement that could alter the sort order inuser-unexpected ways.


Does the proviso make more sense now?

- J.J. Larrea

On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote:

I am looking at adding some sorting functionality to my applicationandread that Sort fields should not be tokenized - can anyone explainwhy?
I have code that is tokenizing the sort fields and it seems to be
working. Is it just because some tokenizing can change the value(like
remove stop words and such) which can produce an invalid sort order?
Thanks.

-Jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Sort fields shouldn't be tokenized

Reply via email to