[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2181: -------------------------------- Attachment: LUCENE-2181.patch ok, somehow it completely bypassed my brain you are using ReadTokens task :) so this is a problem, because ReadTokens does not respect the DocMaker configuration. In my opinion it should not tokenize fields unless they are configured to be tokenized. So I added the following in this patch to fix this: {noformat} for(final Fieldable field : fields) { + if (!field.isTokenized()) continue; + {noformat} now we get the results we expect: ||Language||java.text||ICU4J||KeywordAnalyzer||ICU4J Improvement|| |English|3.43s|2.21s|1.15s|115%| |French|3.78s|2.37s|1.17s|117%| |German|3.84s|2.42s|1.18s|115%| |Ukrainian|5.81s|3.67s|1.24s|88%| if you comment out the doc.tokenized=false, then you get the other results i just posted instead, as it will analyze the other fields too. > benchmark for collation > ----------------------- > > Key: LUCENE-2181 > URL: https://issues.apache.org/jira/browse/LUCENE-2181 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: Robert Muir > Assignee: Robert Muir > Attachments: LUCENE-2181.patch, LUCENE-2181.patch, LUCENE-2181.patch, > LUCENE-2181.patch, LUCENE-2181.patch, > top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2 > > > Steven Rowe attached a contrib/benchmark-based benchmark for collation (both > jdk and icu) under LUCENE-2084, along with some instructions to run it... > I think it would be a nice if we could turn this into a committable patch and > add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org