[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Mark Miller (JIRA) Thu, 04 Sep 2008 15:36:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628496#action_12628496
 ]


Mark Miller commented on LUCENE-1372:
-------------------------------------

Ah, but right now, the documentation will tell you its not supported, and 
possibly even that you can't do it (though you can because the check that would 
stop you is basically a guess). So it looks funny when presenting data because 
you are violating the rules<g> You are not just proposing making it work more 
unintuitive, but by necessity, you are also proposing Lucene support sorting on 
a field with multiple tokens in the first place, because the stance right now 
is that it does not - hence the weak guard around it in the code.


> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Reply via email to