[ https://issues.apache.org/jira/browse/LUCENE-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805909#action_12805909 ]
Simon Willnauer edited comment on LUCENE-2183 at 1/28/10 1:16 PM: ------------------------------------------------------------------ I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though .alg file {code} analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource content.source.forever=false { "Rounds" { "ReadTokens" ReadTokens > : * NewRound ResetSystemErase} : 10 RepAll {code} 10 Rounds with the latest patch {code} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 14.83 5,049,432 66,453,504 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.07 - 34,558,000 - 55,705,600 [java] ReadTokens_Exhaust 1 1 0 0.00 1.40 41,865,312 60,555,264 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 34,393,904 - 63,176,704 [java] ReadTokens_Exhaust 3 1 0 0.00 1.24 15,440,624 64,487,424 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 7,540,512 - 65,601,536 [java] ReadTokens_Exhaust 5 1 0 0.00 1.21 50,174,760 67,239,936 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.19 - 22,202,768 - 67,174,400 [java] ReadTokens_Exhaust 7 1 0 0.00 1.19 20,591,672 68,812,800 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.18 - 63,749,984 - 69,009,408 [java] ReadTokens_Exhaust 9 1 0 0.00 1.19 22,331,600 68,943,872 {code} 10 rounds with Proxy Class {code} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 16.33 5,021,144 67,436,544 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.34 - 44,649,496 - 59,244,544 [java] ReadTokens_Exhaust 1 1 0 0.00 1.53 36,681,952 61,472,768 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.37 - 13,863,688 - 64,094,208 [java] ReadTokens_Exhaust 3 1 0 0.00 1.34 50,247,864 65,470,464 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.36 - 14,922,888 - 66,322,432 [java] ReadTokens_Exhaust 5 1 0 0.00 1.36 5,718,296 67,371,008 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 54,583,776 - 67,502,080 [java] ReadTokens_Exhaust 7 1 0 0.00 1.33 35,739,800 68,943,872 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 24,985,688 - 69,861,376 [java] ReadTokens_Exhaust 9 1 0 0.00 1.29 64,138,112 69,730,304 {code} 10 rounds with current trunk {code} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 15.19 5,040,928 66,256,896 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.15 - 39,548,440 - 55,443,456 [java] ReadTokens_Exhaust 1 1 0 0.00 1.43 28,088,544 60,096,512 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.27 - 16,004,088 - 61,800,448 [java] ReadTokens_Exhaust 3 1 0 0.00 1.25 51,034,016 63,045,632 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 23,371,056 - 63,504,384 [java] ReadTokens_Exhaust 5 1 0 0.00 1.24 12,964,368 65,208,320 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.25 - 6,598,128 - 65,601,536 [java] ReadTokens_Exhaust 7 1 0 0.00 1.23 50,932,464 67,239,936 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 20,433,136 - 67,305,472 [java] ReadTokens_Exhaust 9 1 0 0.00 1.23 63,638,552 68,812,800 {code} was (Author: simonw): I did run following benchmark alg file against the latest patch (specialized old and new methods), the patch with the proxy methods and the old 3.0 code. The outcome shows that the specialized code is about ~8% faster than the proxy class based code so I would rather keep the specialized code as this class is performance sensitive though .alg file {quote} analyzer=org.apache.lucene.analysis.WhitespaceAnalyzer content.source=org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource content.source.forever=false { "Rounds" { "ReadTokens" ReadTokens > : * NewRound ResetSystemErase} : 10 RepAll {quote} 10 Rounds with the latest patch {quote} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 14.83 5,049,432 66,453,504 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.07 - 34,558,000 - 55,705,600 [java] ReadTokens_Exhaust 1 1 0 0.00 1.40 41,865,312 60,555,264 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 34,393,904 - 63,176,704 [java] ReadTokens_Exhaust 3 1 0 0.00 1.24 15,440,624 64,487,424 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.22 - 7,540,512 - 65,601,536 [java] ReadTokens_Exhaust 5 1 0 0.00 1.21 50,174,760 67,239,936 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.19 - 22,202,768 - 67,174,400 [java] ReadTokens_Exhaust 7 1 0 0.00 1.19 20,591,672 68,812,800 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.18 - 63,749,984 - 69,009,408 [java] ReadTokens_Exhaust 9 1 0 0.00 1.19 22,331,600 68,943,872 {quote} 10 rounds with Proxy Class {quote} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 16.33 5,021,144 67,436,544 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.34 - 44,649,496 - 59,244,544 [java] ReadTokens_Exhaust 1 1 0 0.00 1.53 36,681,952 61,472,768 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.37 - 13,863,688 - 64,094,208 [java] ReadTokens_Exhaust 3 1 0 0.00 1.34 50,247,864 65,470,464 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.36 - 14,922,888 - 66,322,432 [java] ReadTokens_Exhaust 5 1 0 0.00 1.36 5,718,296 67,371,008 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 54,583,776 - 67,502,080 [java] ReadTokens_Exhaust 7 1 0 0.00 1.33 35,739,800 68,943,872 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.32 - 24,985,688 - 69,861,376 [java] ReadTokens_Exhaust 9 1 0 0.00 1.29 64,138,112 69,730,304 {quote} 10 rounds with current trunk {quote} [java] ------------> Report All (11 out of 12) [java] Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_10 0 1 0 0.00 15.19 5,040,928 66,256,896 [java] ReadTokens_Exhaust - 0 - - 1 - - - - 0 - - - 0.00 - - 2.15 - 39,548,440 - 55,443,456 [java] ReadTokens_Exhaust 1 1 0 0.00 1.43 28,088,544 60,096,512 [java] ReadTokens_Exhaust - 2 - - 1 - - - - 0 - - - 0.00 - - 1.27 - 16,004,088 - 61,800,448 [java] ReadTokens_Exhaust 3 1 0 0.00 1.25 51,034,016 63,045,632 [java] ReadTokens_Exhaust - 4 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 23,371,056 - 63,504,384 [java] ReadTokens_Exhaust 5 1 0 0.00 1.24 12,964,368 65,208,320 [java] ReadTokens_Exhaust - 6 - - 1 - - - - 0 - - - 0.00 - - 1.25 - 6,598,128 - 65,601,536 [java] ReadTokens_Exhaust 7 1 0 0.00 1.23 50,932,464 67,239,936 [java] ReadTokens_Exhaust - 8 - - 1 - - - - 0 - - - 0.00 - - 1.24 - 20,433,136 - 67,305,472 [java] ReadTokens_Exhaust 9 1 0 0.00 1.23 63,638,552 68,812,800 {quote} > Supplementary Character Handling in CharTokenizer > ------------------------------------------------- > > Key: LUCENE-2183 > URL: https://issues.apache.org/jira/browse/LUCENE-2183 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Simon Willnauer > Assignee: Uwe Schindler > Fix For: 3.1 > > Attachments: LUCENE-2183.patch, LUCENE-2183.patch, LUCENE-2183.patch, > LUCENE-2183.patch, LUCENE-2183.patch > > > CharTokenizer is an abstract base class for all Tokenizers operating on a > character level. Yet, those tokenizers still use char primitives instead of > int codepoints. CharTokenizer should operate on codepoints and preserve bw > compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org