[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Paul Cowan (JIRA) Tue, 03 Mar 2009 22:10:27 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Cowan updated LUCENE-1372:
-------------------------------

    Attachment: LUCENE-1372-MultiValueSorters.patch

I think we're after somewhat different things here Uwe, but still pulling 
generally in the same direction.

For your case, I personally am in favour of:
1) replacing (as I did in my original patch) the loops in FieldCacheImpl that 
look like this:
{code}while (termDocs.next()) {
   retArray[termDocs.doc()] = termval;
}{code}
with ones that look like this:
{code}while (termDocs.next()) {
   if (retArray[termDocs.doc() == null) {
     retArray[termDocs.doc()] = termval;
   }
}{code}
(or == 0 for the int case, == 0.0 for the float case, whatever). This, I think, 
meets your sorting goal (order by lexicographically first term using simple 
binary ordering of the term text). You then just need either:
a) a code path that uses FieldCacheImpl.getStrings() rather than than 
.getStringIndex() (the former doesn't care about the more-terms-than-documents 
case), but this is obviously not optimally performant
b) a change to .getStringIndex which doesn't assume that there are fewer terms 
than documents. Not sure if this is harder or not.... don't know if there is an 
easy way to find the number of terms for a field in advance to size the array?
I think 'multi-value fields order by the first term in binary string order' is 
a valid behaviour, doesn't 'dirty' the codebase, is easy to document + explain, 
and suits cases like Uwe's where the fact that it's stored as a String is kind 
of irrelevant (for TrieRange, you'd be just as happy with a byte[] as a String, 
right?) So that, I think, would suit you fine.

2) For OUR case, we might have docs indicated above:
doc 1: {"apple"}
doc 2: {"banana"}
doc 3: {"apple", "banana"}
doc 4: {"apple", "zebra"}
and we'd like them sorted lexicographically in what most english speakers would 
call the 'expected' order (1, 3, 4, 2) this won't really help (the case above 
was really just a half-hearted compromise). You might ask why we don't just 
index a single term for each ("apple", "banana", "apple/banana", "apple/zebra" 
and sort by that, but as well as being flaky if the separator character is used 
in an actual term, this breaks for multi-language sorting; 'banana' might sort 
before 'apple' in another locale. Imagine if these were people's surnames, we 
need to follow expected order. If you have 1000 values in random combinations 
of 10 this also makes the index terms eat up serious memory)

For this case, I have attached a patch which may or may not be a useful basis 
for doing this behaviour 'correctly'. It's implemented as a static factory 
class for producing SortComparatorSources which have the correct behaviour. 
There's little javadoc for now, but a test case which should explain relatively 
easily how it works. This is very low-impact on Lucene; if people want it in 
Lucene-core, great; if it can go in contrib, great; if not, we can keep it 
separate, though for it to work there is a minor change required to 
[Extended]FieldCacheImpl to expose the default parsers. We could remove that, 
but it would be good to up those default parsers to default access.

Please let me know what you think of this patch; it's not overly performance 
(rather than a String[] or a float[] for the terms, it uses a 
ArrayList<String>[] or ArrayList<Float>[], which is more overhead (especially 
in the latter case; 4 bytes per doc for a primitive float explodes a bit for an 
ArrayList of Float objects) but I suspect this will be acceptable for certain 
cases and can be appropriately documented.

> Proposal: introduce more sensible sorting when a doc has multiple values for 
> a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1372-MultiValueSorters.patch, 
> lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting 
> on a field for which multiple values exist for one document. For example, 
> imagine a field "fruit" which is added to a document multiple times, with the 
> values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in 
> FieldCacheImpl.stringsIndexCache.createValue() (and similarly for the other 
> methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each 
> one, overwrite retArray[doc] with the value for each document with that term. 
> Effectively, this overwriting means that a string sort in this circumstance 
> will sort by the LAST term lexicographically, so the docs above will 
> effecitvely be sorted as if they had the single values ("apple", "banana", 
> "banana", "zebra") which is nonintuitive. To change this to sort on the first 
> time in the TermEnum seems relatively trivial and low-overhead; while it's 
> not perfect (it's not local-aware, for example) the behaviour seems much more 
> sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term

Reply via email to