[jira] Issue Comment Edited: (LUCENE-2181) benchmark for collation

Steven Rowe (JIRA) Sun, 10 Jan 2010 09:58:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798519#action_12798519
 ]


Steven Rowe edited comment on LUCENE-2181 at 1/10/10 5:56 PM:
--------------------------------------------------------------

bq. What about this per-field thing, what if in the data files, title and date 
were simply blank?

Hmm, although the date field value is meaningless, I like the TF-in-title-field 
thing.

{quote}
Or should we worry, I agree its stupid, does it skew the results though?
One way to look at it is that its also fairly realistic (even though its 
meaningless, you see numbers and dates everywhere).
{quote}

I was thinking that it would, and that it's not really a meaningful test of 
collation - who's going to bother running collation over integers and dates? - 
but since the comparison here is between two implementations of collation, I 
think you're right that there is no skew in doing this comparison:
{panel}
icu(kiwi) + icu(apple) + icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange)
{panel}
instead of this one:
{panel}
keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + 
jdk(orange)
{panel}
(where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for 
the jdk(X) transform)

bq. The downside to doing per-analyzer wrapper is that it introduces some 
complexity, in all honesty this is not really specific to this collation task, 
right? (i.e. the existing analysis/tokenization benchmarks have this same 
problem)

Yup, you're right.  A general facility to do this will end up looking (modulo 
syntax) like Solr's per-field analysis specification.

      was (Author: steve_rowe):
    bq. What about this per-field thing, what if in the data files, title and 
date were simply blank?

Hmm, although the date field value is meaningless, I like the TF-in-title-field 
thing.

{quote}
Or should we worry, I agree its stupid, does it skew the results though?
One way to look at it is that its also fairly realistic (even though its 
meaningless, you see numbers and dates everywhere).
{quote}

I was thinking that it would, and that it's not really a meaningful test of 
collation - who's going to bother running collation over integers and dates? - 
but since the comparison here is between two implementations of collation, I 
think you're right that there is no skew in doing this comparison:
{panel}
icu(kiwi) + icu(apple) + (icu(orange) : jdk(kiwi) + jdk(apple) + jdk(orange)
{panel}
instead of this one:
{panel}
keyword(kiwi) + keyword(apple) + icu(orange) : keyword(kiwi) + keyword(apple) + 
jdk(orange)
{panel}
(where the icu(X) transform = keyword(X) + icu-collation(X), and similarly for 
the jdk(X) transform)

bq. The downside to doing per-analyzer wrapper is that it introduces some 
complexity, in all honesty this is not really specific to this collation task, 
right? (i.e. the existing analysis/tokenization benchmarks have this same 
problem)

Yup, you're right.  A general facility to do this will end up looking (modulo 
syntax) like Solr's per-field analysis specification.
  
> benchmark for collation
> -----------------------
>
>                 Key: LUCENE-2181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2181
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/benchmark
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>         Attachments: LUCENE-2181.patch, LUCENE-2181.patch, 
> top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
>
>
> Steven Rowe attached a contrib/benchmark-based benchmark for collation (both 
> jdk and icu) under LUCENE-2084, along with some instructions to run it... 
> I think it would be a nice if we could turn this into a committable patch and 
> add it to benchmark.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2181) benchmark for collation

Reply via email to