[ https://issues.apache.org/jira/browse/LUCENE-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rowe updated LUCENE-2084: -------------------------------- Attachment: TopTFWikipediaWords.tar.bz2 TopTFWikipediaWords.tar.bz2 contains a Maven2 project to parse unpacked Wikipedia dump files, create a Lucene index from the tokens produced by the contrib WikipediaTokenizer, iterate over the indexed tokens' term docs, accumulating term frequencies, store the results in a bounded priority queue, then output contrib benchmark LineDoc format, with the title field containing the collection term frequency, the date containing the date the file was generated, and the body containing the term text. This code knows how to handle English, German, French, and Ukrainian, but could be extended for other languages. I used this project to generate the line-docs for the 4 languages' 100k most frequent terms, in the collation benchmark archive attachment on this issue. > remove Byte/CharBuffer wrapping for collation key generation > ------------------------------------------------------------ > > Key: LUCENE-2084 > URL: https://issues.apache.org/jira/browse/LUCENE-2084 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Reporter: Robert Muir > Assignee: Robert Muir > Priority: Minor > Fix For: 3.1 > > Attachments: collation.benchmark.tar.bz2, LUCENE-2084.patch, > LUCENE-2084.patch, TopTFWikipediaWords.tar.bz2 > > > We can remove the overhead of ByteBuffer and CharBuffer wrapping in > CollationKeyFilter and ICUCollationKeyFilter. > this patch moves the logic in IndexableBinaryStringTools into char[],int,int > and byte[],int,int based methods, with the previous Byte/CharBuffer methods > delegating to these. > Previously, the Byte/CharBuffer methods required a backing array anyway. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org