[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
[ https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241324#comment-13241324 ] Christian Moen edited comment on LUCENE-3935 at 3/29/12 3:51 PM: - Thanks. Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data. Very impressive! I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data. Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change. The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent. I was expecting more than a 13% increase in this method's performance after the change, hoping that all the connection costs would be in very local cache, but this number looks correct to me. Would be great to get your feedback if this is in line with expectations, Dawid and Robert. Do we still want to apply this? was (Author: cm): Thanks. Robert has done a great job making the binary version of {{matrix.def}} tiny with fancy encoding of data. Very impressive! I've attached a patch and and verified that segmentation (surface forms only) match exactly those with the two-dimensional array based on approx. 100,000 Wikipedia articles with XML markup and all, totaling 880MB of data. Profiling tells me we get a 13% increase in performance on {{ConnectionCosts.get()}} after the change. The method is called very, very frequently on indexing, and it's total CPU contribution is ~7-8% _after the change_, so the net improvement here is not more than a couple of percent. I was expecting more than a 13% increase in this method's performance after the change, but this number looks correct to me. Would be great to get your feedback if this is in line with expectations, Dawid and Robert. Do we still want to apply this? Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method --- Key: LUCENE-3935 URL: https://issues.apache.org/jira/browse/LUCENE-3935 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: LUCENE-3935.patch I've been profiling Kuromoji, and not very surprisingly, method {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in the Viterbi is called many many times and contributes to more processing time than I had expected. This method is currently backed by a {{short[][]}}. This data stored here structure is a two dimensional array with both dimensions being fixed with 1316 elements in each dimension. (The data is {{matrix.def}} in MeCab-IPADIC.) We can rewrite this to use a single one-dimensional array instead, and we will at least save one bounds check, a pointer reference, and we should also get much better cache utilization since this structure is likely to be in very local CPU cache. I think this will be a nice optimization. Working on it... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:00 PM: --- h5. Test 1: Indexing Japanese Wikipedia I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM: --- h5. Test 1: Indexing Japanese Wikipedia I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | was (Author: cm): h3. Test 1: Indexing Japanese Wikipedia I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM: --- .h5 Test setup My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. was (Author: cm): *Setup* My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:02 PM: --- h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:21 PM: --- h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:25 PM: --- h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size and no fancy GC options (and all of Wikipedia searchable) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. was (Author: cm): h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size and no fancy GC options (and all of Wikipedia searchable) Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:26 PM: --- h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable (!) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. was (Author: cm): h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size and no fancy GC options (and all of Wikipedia searchable) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:37 PM: --- .h5 Test setup My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. The default search field is body. was (Author: cm): .h5 Test setup My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:38 PM: --- h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable. Very nice :) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. was (Author: cm): h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable (!) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: *
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:49 PM: --- h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. However, if Kuromoji was generating garbage, I'd expect to see it here since input in XML format is 1.7GB and the Viterbi would generate data many many times that size during tokenization. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 4:59 PM: --- h5. Test setup My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. The default search field is body. was (Author: cm): .h5 Test setup My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core i7 CPU (4 cores), a 500GB SSD and too many things running. (The purpose of the test is to test stability and not to provide accurate performance numbers, although I also hope to do that.) My java is as follows: {noformat} [cm@ayu:~] java -version java version 1.6.0_26 Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode) {noformat} I've added fields body and title to {{schema.xml}} and they're using the default Japanese configuration in {{text_ja}}. The default search field is body. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:22 PM: --- h3. Test 1 - Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. However, if Kuromoji was generating garbage, I'd expect to see it here since input in XML format is 1.7GB and the Viterbi would generate data many many times that size during tokenization. I'm attaching these files || Attachment || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. was (Author: cm): h5. Test 1: Indexing Japanese Wikipedia In this test I'm only indexing documents -- no searching is being done. I've extracted text pretty accurately from Japanese Wikipedia and removed all the gory markup so the content is clean. There are 1,443,764 documents in total and this is mix of short and very long documents. These have been converted this to files in Solr XML format and there is 1,000 documents per file. I'm running my Solr simply using {noformat} java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I'm not using any fancy GC options. I'm posting using {noformat} curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml {noformat} and committing after all the files have been posted with {noformat} curl -s http://localhost:8983/solr/update -F 'stream.body= commit /' {noformat} Posting the entire Wikipedia in one file is perhaps a lot faster. Posting took {noformat} real18m39.206s user0m12.682s sys 0m11.065s {noformat} The GC log looks fine with a maximum GC time of 0.0187319 seconds. There wasn't even a full GC probably like to the large heap size. However, if Kuromoji was generating garbage, I'd expect to see it here since input in XML format is 1.7GB and the Viterbi would generate data many many times that size during tokenization. I'm attaching these files || Filename || Description || |jawiki-index-gc.log| GC log | |jawiki-index-gcviewer.png| Screenshot from GCViewer | |jawiki-index-visualvm.png| Screenshot from VisualVM | Note that GCViewer had problems parsing the log file so the data in the screenshot might be off. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected *
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:21 PM: --- h3. Test 2 - Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable. Very nice :) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Attachment || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. was (Author: cm): h5. Test 2: Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable. Very nice :) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Filename || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM: --- h5. Test 3 - Searching with highlighting (no indexing) The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 queries were run. No indexing was done. Solr was run as follows {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} and - again - notice a small heap size and regular GC options. The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body {noformat} which is {noformat} /solr/select/?q=無料占いhl=onhl.fl=body {noformat} in unquoted form. We have turned on highlighting and we are highlighting on the body field. The test completes in 1648.1 seconds and 63200 queries were run and the sustainable query rate was 47 QPS. Turning on highlighting has a fairly significant performance penalty if we compare QPS to the non-highlighting case where we could sustain 142 QPS. There is also increased memory pressure with highlighting turned on. There were 652 Full GC events in total in the period and the longest Full GC times is given below. || Longest Full GC times (seconds) || |0.9769069| |0.8564934| |0.7585956| |0.7084318| |0.6928327| |0.6781336| |0.6358398| |0.6099899| |0.5628532| |0.5540237| |0.5443075| |0.5429399| |0.5423989| |...| The extra memory pressure can also be seen in the VisualVM screenshot. I believe the root cause of this is the highlighting. || Attachment || Description || | 62k-queries-highlight-gc.log| GC log | | 62k-queries-highlight-visualvm.png| Screenshot from VisualVM | was (Author: cm): h5. Test 3 - Searching with highlighting (no indexing) The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 queries were run. No indexing was done. Solr was run as follows {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} and - again - notice a small heap size and regular GC options. The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body {noformat} which is {noformat} /solr/select/?q=無料占いhl=onhl.fl=body {noformat} in unquoted form. We have turned on highlighting and we are highlighting on the body field. The test completes in 1648.1 seconds and 63200 queries were run and the sustainable query rate was 47 QPS. Turning on highlighting has a fairly significant performance penalty if we compare QPS to the non-highlighting case where we could sustain 142 QPS. There is also increased memory pressure with highlighting turned on. There were 652 Full GC events in total in the period and the longest Full GC times is given below. || Longest Full GC times (seconds) || |0.9769069| |0.8564934| |0.7585956| |0.7084318| |0.6928327| |0.6781336| |0.6358398| |0.6099899| |0.5628532| |0.5540237| |0.5443075| |0.5429399| |0.5423989| |...| The extra memory pressure can also be seen in the VisualVM screenshot. || Attachment || Description || | 62k-queries-highlight-gc.log| GC log | | 62k-queries-highlight-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM: --- h3. Test 2 - Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable. Very nice :) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Attachment || Description || | 250k-queries-no-highlight-gc.log | GC log | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. was (Author: cm): h3. Test 2 - Searching without highlighting (no indexing) After the Wikipedia index was build, I've ran 250,000 fairly common Japanese queries against the index without highlighting and by using simple means. For this test, I was running Java using {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} so - small/normal heap size to keep memory pressure a bit high and no fancy GC options -- and all of Wikipedia searchable. Very nice :) The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84 {noformat} which is {noformat} /solr/select/?q=無料占い {noformat} in plain unquoted form. Running the 250,000 queries took 1838.5 seconds and the test was roughly able to keep 80% of its queries within 0.5 second latency and serve a sustained load of 142 QPS. The GC logs have some Full GC entries in them: || GC Activity || Time || | Full GC 57558K-36262K(126912K) | 0.2926001 secs | | Full GC 120759K-37151K(126912K) | 0.2948184 secs | | Full GC 118817K-38305K(126912K) | 0.3726583 secs | | Full GC 116992K-40203K(126912K) | 0.3688027 secs | | Full GC 119572K-39070K(126912K) | 0.2896587 secs | | Full GC 121476K-39257K(126912K) | 0.3034882 secs | | Full GC 119659K-39451K(126912K) | 0.3078915 secs | | Full GC 116948K-39770K(126912K) | 0.2407321 secs | | Full GC 118382K-40442K(126912K) | 0.5224920 secs | The regular GC entries took a maximum of 0.0731031 seconds, but most half or or less. || Attachment || Description || | 250k-queries-no-highlight-gc.log | Screenshot from GCViewer | | 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM | GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so I'm not attaching a screenshot for this. Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:25 PM: --- h3. Test 3 - Searching with highlighting (no indexing) The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 queries were run. No indexing was done. Solr was run as follows {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} and - again - notice a small heap size and regular GC options. The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body {noformat} which is {noformat} /solr/select/?q=無料占いhl=onhl.fl=body {noformat} in unquoted form. We have turned on highlighting and we are highlighting on the body field. The test completes in 1648.1 seconds and 63200 queries were run and the sustainable query rate was 47 QPS. Turning on highlighting has a fairly significant performance penalty if we compare QPS to the non-highlighting case where we could sustain 142 QPS. There is also increased memory pressure with highlighting turned on. There were 652 Full GC events in total in the period and the longest Full GC times is given below. || Longest Full GC times (seconds) || |0.9769069| |0.8564934| |0.7585956| |0.7084318| |0.6928327| |0.6781336| |0.6358398| |0.6099899| |0.5628532| |0.5540237| |0.5443075| |0.5429399| |0.5423989| |...| The extra memory pressure can also be seen in the VisualVM screenshot. I believe the root cause of this is the highlighting. || Attachment || Description || | 62k-queries-highlight-gc.log| GC log | | 62k-queries-highlight-visualvm.png| Screenshot from VisualVM | was (Author: cm): h5. Test 3 - Searching with highlighting (no indexing) The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 queries were run. No indexing was done. Solr was run as follows {noformat} java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar {noformat} and - again - notice a small heap size and regular GC options. The queries are on the form {noformat} /solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body {noformat} which is {noformat} /solr/select/?q=無料占いhl=onhl.fl=body {noformat} in unquoted form. We have turned on highlighting and we are highlighting on the body field. The test completes in 1648.1 seconds and 63200 queries were run and the sustainable query rate was 47 QPS. Turning on highlighting has a fairly significant performance penalty if we compare QPS to the non-highlighting case where we could sustain 142 QPS. There is also increased memory pressure with highlighting turned on. There were 652 Full GC events in total in the period and the longest Full GC times is given below. || Longest Full GC times (seconds) || |0.9769069| |0.8564934| |0.7585956| |0.7084318| |0.6928327| |0.6781336| |0.6358398| |0.6099899| |0.5628532| |0.5540237| |0.5443075| |0.5429399| |0.5423989| |...| The extra memory pressure can also be seen in the VisualVM screenshot. I believe the root cause of this is the highlighting. || Attachment || Description || | 62k-queries-highlight-gc.log| GC log | | 62k-queries-highlight-visualvm.png| Screenshot from VisualVM | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators:
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714 ] Christian Moen edited comment on SOLR-3282 at 3/27/12 5:46 PM: --- h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching. The search rate is a constant 10 QPS. The queries in this test are identical to those run above and they are also unique. Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test has now been running for 15 minutes and I'll let it run for hours. I'll post details later. :) was (Author: cm): h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching at a constant 10 QPS rate. The queries in this test are identical to those run above Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test has now been running for 15 minutes and I'll let it run for hours. I'll post details later. :) Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714 ] Christian Moen edited comment on SOLR-3282 at 3/28/12 2:37 AM: --- h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching. The search rate is a constant 10 QPS. The queries in this test are identical to those run above and they are also unique. Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test had been running for 8 hours and 33 minutes before I stopped it and 312,900 queries were run. Japanese Wikipedia was indexed 23 times. Full GC occurred 84 times and the maximum heap-size provided to the VM was allocated. The longest Full GC times are given below. || Longest Full GC (seconds) || |1.0789668| |1.0518156| |1.0288781| |0.9973905| |0.9799409| |0.9582144| |0.9555027| |0.9517524| |0.9456611| |0.9387380| |0.9313493| |0.9117388| |0.8771426| |...| The longest regular (non-Full) GC times are below. || Longest non-Full GC (seconds) | |0.1375324| |0.1206866| |0.1009028| |0.0952712| |0.0928364| |...| The VisualVM screenshot suggests that the VM is nice and stable. It might be good to provide a little more maximum heap-space than 256MB to index all of Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB seems quite fine. || Attachment || Description || | long-query-indexing-gc.log | GC log | | long-search-indexing-visualvm.png | VisualVM screenshot | was (Author: cm): h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching. The search rate is a constant 10 QPS. The queries in this test are identical to those run above and they are also unique. Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test has now been running for 15 minutes and I'll let it run for hours. I'll post details later. :) Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png, long-query-indexing-gc.log, long-search-indexing-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected * Memory and heap usage looks stable over time * Garbage collection is overall low over time -- no Full-GC issues I'll post findings and results to this JIRA. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze
[ https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714 ] Christian Moen edited comment on SOLR-3282 at 3/28/12 2:48 AM: --- h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching. The search rate is a constant 10 QPS with highlighting. The queries in this test are identical to those run above and they are also unique. Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test had been running for 8 hours and 33 minutes before I stopped it and 312,900 queries were run. Japanese Wikipedia was indexed 23 times. Full GC occurred 84 times and the maximum heap-size provided to the VM was allocated. The longest Full GC times are given below. || Longest Full GC (seconds) || |1.0789668| |1.0518156| |1.0288781| |0.9973905| |0.9799409| |0.9582144| |0.9555027| |0.9517524| |0.9456611| |0.9387380| |0.9313493| |0.9117388| |0.8771426| |...| The longest regular (non-Full) GC times are below. || Longest non-Full GC (seconds) | |0.1375324| |0.1206866| |0.1009028| |0.0952712| |0.0928364| |...| The VisualVM screenshot suggests that the VM is nice and stable. It might be good to provide a little more maximum heap-space than 256MB to index all of Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB seems quite fine. || Attachment || Description || | long-query-indexing-gc.log | GC log | | long-search-indexing-visualvm.png | VisualVM screenshot | was (Author: cm): h3. Test 4 - Combined search and indexing test In this test, we are both indexing all of Wikipedia while searching. The search rate is a constant 10 QPS. The queries in this test are identical to those run above and they are also unique. Solr is started using {noformat} java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar {noformat} so I've given it a little more heap because of the memory pressure issue seen in _Test 3_. The indexing posts the XML described in _Test 1_ - each file contains 1,000 documents and - different from _Test 1_ we now do a commit after each post. No optimize is being done. The test had been running for 8 hours and 33 minutes before I stopped it and 312,900 queries were run. Japanese Wikipedia was indexed 23 times. Full GC occurred 84 times and the maximum heap-size provided to the VM was allocated. The longest Full GC times are given below. || Longest Full GC (seconds) || |1.0789668| |1.0518156| |1.0288781| |0.9973905| |0.9799409| |0.9582144| |0.9555027| |0.9517524| |0.9456611| |0.9387380| |0.9313493| |0.9117388| |0.8771426| |...| The longest regular (non-Full) GC times are below. || Longest non-Full GC (seconds) | |0.1375324| |0.1206866| |0.1009028| |0.0952712| |0.0928364| |...| The VisualVM screenshot suggests that the VM is nice and stable. It might be good to provide a little more maximum heap-space than 256MB to index all of Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB seems quite fine. || Attachment || Description || | long-query-indexing-gc.log | GC log | | long-search-indexing-visualvm.png | VisualVM screenshot | Perform Kuromoji/Japanese stability test before 3.6 freeze -- Key: SOLR-3282 URL: https://issues.apache.org/jira/browse/SOLR-3282 Project: Solr Issue Type: Task Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Assignee: Christian Moen Attachments: 250k-queries-no-highlight-gc.log, 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, jawiki-index-gcviewer.png, jawiki-index-visualvm.png, long-query-indexing-gc.log, long-search-indexing-visualvm.png Kuromoji might be used by many and also in mission critical systems. I'd like to run a stability test before we freeze 3.6. My thinking is to test the out-of-the-box configuration using fieldtype {{text_ja}} as follows: # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a never ending loop # Simultaneously run many tens of thousands typical Japanese queries against the index at 3-5 queries per second with highlighting turned on While Solr is indexing and searching, I'd like to verify that: * Indexing and queries are working as expected *
[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253 ] Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:44 AM: -- Hello, Kazu. Long time no see -- I hope things are well! This is very good feature request. I think this is possible by changing how we emit unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation options. For example, if we find an unknown word トートバッグ (by regular greedy matching), we can emit {noformat} ト トー トート トートバ トートバッ トートバッグ {noformat} in the current position. When we reach the position that starts with バッグ, we'll find a known word, and when the Viterbi runs, it's likely to choose トート and バッグ as the best path. Let me have a play by looking into the lattice details and see if something like this is feasible. was (Author: cm): Hello, Kazu. Long time no see -- I hope things are well! This is very good feature request. I think this is possible by changing how we emit unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation options. For example, if we find an unknown word トートバッグ (by regular greedy matching), we can emit {noformat} ト トー トート トートバ トートバッ トートバッグ {noformat} in the current position. When we reach the position that starts with バッグ, we'll find a known word, and when the Viterbi runs, it's likely to choose トート and バッグ as the best path. Let me have a look at this by looking into the lattice details. Add decompose compound Japanese Katakana token capability to Kuromoji - Key: LUCENE-3921 URL: https://issues.apache.org/jira/browse/LUCENE-3921 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe Reporter: Kazuaki Hiraga Labels: features Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be decomposed, but it cannot be applied every Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary has バッグ in its entry. I would like to apply the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or add the capability to force apply the decompose feature to every Katakana tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253 ] Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:57 AM: -- Hello, Kazu. Long time no see -- I hope things are well! This is very good feature request. I think this might be possible by changing how we emit unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation options. For example, if we find an unknown word トートバッグ (by regular greedy matching), we can emit {noformat} ト トー トート トートバ トートバッ トートバッグ {noformat} in the current position. When we reach the position that starts with バッグ we'll find a known word. When the Viterbi runs, it's likely to choose トート and バッグ as its best path. Let me have a play by looking into the lattice details and see if something like this is feasible. We are sort of hacking the model here so we also need to consider side-effects. was (Author: cm): Hello, Kazu. Long time no see -- I hope things are well! This is very good feature request. I think this is possible by changing how we emit unknown words, i.e. by not emitting them as greedily and giving the lattice more segmentation options. For example, if we find an unknown word トートバッグ (by regular greedy matching), we can emit {noformat} ト トー トート トートバ トートバッ トートバッグ {noformat} in the current position. When we reach the position that starts with バッグ, we'll find a known word, and when the Viterbi runs, it's likely to choose トート and バッグ as the best path. Let me have a play by looking into the lattice details and see if something like this is feasible. Add decompose compound Japanese Katakana token capability to Kuromoji - Key: LUCENE-3921 URL: https://issues.apache.org/jira/browse/LUCENE-3921 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 4.0 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe Reporter: Kazuaki Hiraga Labels: features Japanese morphological analyzer, Kuromoji doesn't have a capability to decompose every Japanese Katakana compound tokens to sub-tokens. It seems that some Katakana tokens can be decomposed, but it cannot be applied every Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary has バッグ in its entry. I would like to apply the decompose feature to every Katakana tokens if the sub-tokens are in the dictionary or add the capability to force apply the decompose feature to every Katakana tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji
[ https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239195#comment-13239195 ] Christian Moen edited comment on LUCENE-3921 at 3/27/12 5:32 AM: - I've been experimenting with the idea outlined above and I thought I should share some very early results. The improvement here is basically to give the compound splitting heuristic an improved ability to split unknown words that are part of compounds. Experiments I've run using using our compound splitting test cases suggest that the effect is indeed positive. The improved heuristic is able to handle some of the test case that we couldn't do earlier, but all of this requires further experimentation and validation. I've been able to segment トートバッグ (tote bag with トート being unknown) and also ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring). It might be possible to tune this up or developer a more advanced heuristic that remedies this, but I haven't had a chance to look further into this. Also, any change here would require extensive testing and validation. See the evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode. Please note that there will not be time to provide improvements here for 3.6, but we can follow up on katakana segmentation for 4.0. With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words that start with ン、ッ、ー since we don't want tokens that start with these characters and consider adding this as an option to the tokenizer if it works well. Having said this, there are real limits to what we can achieve by hacking the statistical model (and it also affects our karma, you know...). The approach above also has performance and memory impact. We'd need to introduce a fairly short limits to how long unknown words can be and this can perhaps only apply to unknown katakana words. The length restriction will be big enough to not have any practical impact on segmentation, though. An alternative approach to all of this is to build some lexical assets. I think we'd get pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms European NLP researchers have developed. Some of these algorithms are pretty simple and quite effective. Thoughts? was (Author: cm): I've been experimenting with the idea outlined above and I thought I should share some very early results. The improvement here is basically to give the compound splitting heuristic an improved ability to split unknown words that are part of compounds. Experiments I've run using using our compound splitting test cases suggest that the effect is indeed positive. The improved heuristic is able to handle some of the test case that we couldn't do earlier, but all of this requires further experimentation and validation. I've been able to segment トートバッグ (tote bag with トート being unknown) and also ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring). It might be possible to tune this up or developer a more advanced heuristic that remedies this, but I haven't had a chance to look further into this. Also, any change here would require extensive testing and validation. See the evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode. Please note that there will not be time to provide improvements here for 3.6, but we can follow up on katakana segmentation for 4.0. With the above idea for katakana in mind, I'm thinking we can skip emitting katakana words that start with ン、ッ、ー since we don't want tokens that start with these characters and consider adding this as an option to the tokenizer if it works well. Having said this, there are real limits to what we can achieve by hacking the statistical model (and it also affects our karma, you know...). The approach above also has performance and memory impact. We'd need to introduce a fairly short limits to how long unknown words can be and this can perhaps only apply to unknown katakana words. The length restriction will be big enough to not have any practical impact on segmentation, though. An alternative approach to all of this is to build some lexical assets. I think we'd get pretty far for katakana if we apply some of the corpus-based compound-splitting algorithms Europeans NLP researchers have developed. These algorithms are simple and quite effective. Thoughts? Add decompose compound Japanese Katakana token capability to Kuromoji - Key: LUCENE-3921 URL:
[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants
[ https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475 ] Christian Moen edited comment on LUCENE-3901 at 3/24/12 8:10 AM: - Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue in 1304728. was (Author: cm): Committed revision 1304727 on {{branch_3x}}. Add katakana stem filter to better deal with certain katakana spelling variants --- Key: LUCENE-3901 URL: https://issues.apache.org/jira/browse/LUCENE-3901 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Christian Moen Assignee: Christian Moen Fix For: 3.6, 4.0 Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch Many Japanese katakana words end in a long sound that is sometimes optional. For example, パーティー and パーティ are both perfectly valid for party. Similarly we have センター and センタ that are variants of center as well as サーバー and サーバ for server. I'm proposing that we add a katakana stemmer that removes this long sound if the terms are longer than a configurable length. It's also possible to add the variant as a synonym, but I think stemming is preferred from a ranking point of view. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants
[ https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475 ] Christian Moen edited comment on LUCENE-3901 at 3/24/12 10:48 AM: -- Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue in revisions 1304728 and 1304741. was (Author: cm): Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue in 1304728. Add katakana stem filter to better deal with certain katakana spelling variants --- Key: LUCENE-3901 URL: https://issues.apache.org/jira/browse/LUCENE-3901 Project: Lucene - Java Issue Type: New Feature Components: modules/analysis Reporter: Christian Moen Assignee: Christian Moen Fix For: 3.6, 4.0 Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch Many Japanese katakana words end in a long sound that is sometimes optional. For example, パーティー and パーティ are both perfectly valid for party. Similarly we have センター and センタ that are variants of center as well as サーバー and サーバ for server. I'm proposing that we add a katakana stemmer that removes this long sound if the terms are longer than a configurable length. It's also possible to add the variant as a synonym, but I think stemming is preferred from a ranking point of view. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3819) Clean up what we show in right side bar of website.
[ https://issues.apache.org/jira/browse/LUCENE-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13213740#comment-13213740 ] Christian Moen edited comment on LUCENE-3819 at 2/22/12 4:21 PM: - +1, Mark. +1, Yonik. I also think it might be useful to have download shortcuts to Lucene Core (Java) and Solr available in the sidebar from http://lucene.apache.org/. Perhaps Download could be considered becoming a standard sidebar item for the subprojects? was (Author: cm): +1, Mark. +1, Yonik. I also think it might be useful to have download shortcuts to Lucene Core (Java) and Solr available in the sidebar from http://lucene.apache.org/. Perhaps Download could be considered to be a standard sidebar item. (I quite like the red download button, though! :)) Clean up what we show in right side bar of website. --- Key: LUCENE-3819 URL: https://issues.apache.org/jira/browse/LUCENE-3819 Project: Lucene - Java Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Priority: Minor I'd love to remove a couple things - it's pretty crowded on the right side bar. I find the latest JIRA and email displays are hard to read, tend to format badly, and don't offer much value. I'd like to remove them and just leave svn commits and twitter mentions (which are much easier to read and format better). Will help with some info overload on each page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3115) Improve default Japanese stopwords.txt description
[ https://issues.apache.org/jira/browse/SOLR-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204459#comment-13204459 ] Christian Moen edited comment on SOLR-3115 at 2/9/12 11:34 AM: --- A patch for {{trunk}} is attached with an improved description for Lucene ({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}). (The latter was synched using {{sync-analyzers}} -- useful!) was (Author: cm): A patch is attached with an improved description for Lucene ({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}). (The latter was synched using {{sync-analyzers}} -- useful!) Improve default Japanese stopwords.txt description -- Key: SOLR-3115 URL: https://issues.apache.org/jira/browse/SOLR-3115 Project: Solr Issue Type: Improvement Components: Rules Affects Versions: 3.6, 4.0 Reporter: Christian Moen Priority: Minor Attachments: SOLR-3115.patch As discussed in SOLR-3056, the description in the default Japanese stopwords.txt should be improved to describe case- and width-handling. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr
[ https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464 ] Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM: - I've updated the patch to now use a {{StopFilter}} that ignores case. was (Author: cm): Updated patch that now uses a {{StopFilter}} that ignores case. Align default Japanese configurations for Lucene and Solr - Key: LUCENE-3751 URL: https://issues.apache.org/jira/browse/LUCENE-3751 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr
[ https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464 ] Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM: - I've updated the patch to now use a {{StopFilter}} that ignores case. I think this is good to go. was (Author: cm): I've updated the patch to now use a {{StopFilter}} that ignores case. Align default Japanese configurations for Lucene and Solr - Key: LUCENE-3751 URL: https://issues.apache.org/jira/browse/LUCENE-3751 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203616#comment-13203616 ] Christian Moen edited comment on SOLR-3056 at 2/8/12 2:05 PM: -- Thanks a lot, Robert. bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. I created LUCENE-3751 with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly by linking the JIRAs. was (Author: cm): Thanks a lot, Robert. bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. I created LUCENE-3751 with a patch earlier make sure the default Lucene and Solr configurations are aligned. Sorry for not pointing this out clearly. Introduce Japanese field type in schema.xml --- Key: SOLR-3056 URL: https://issues.apache.org/jira/browse/SOLR-3056 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: SOLR-3056.patch, SOLR-3056_move.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box support in Solr. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? In order to make the below {{text_ja}} field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Any thoughts? {code:xml} !-- Text field type is suitable for Japanese text using morphological analysis NOTE: Please copy files contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar dist/apache-solr-analysis-extras-x.y.z.jar to your Solr lib directory (i.e. example/solr/lib) before before starting Solr. (x.y.z refers to a version number) If you would like to optimize for precision, default operator AND with solrQueryParser defaultOperator=AND/ below (this file). Use OR if you would like to optimize for recall (default). -- fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer !-- Kuromoji Japanese morphological analyzer/tokenizer Use search-mode to get a noun-decompounding effect useful for search. Example: 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport) so we get a match for 空港 (airport) as we would expect from a good search engine Valid values for mode are: normal: default segmentation search: segmentation useful for search (extra compound splitting) extended: search mode with unigramming of unknown words (experimental) NOTE: Search mode improves segmentation for search at the expense of part-of-speech accuracy -- tokenizer class=solr.KuromojiTokenizerFactory mode=search/ !-- Reduces inflected verbs and adjectives to their base/dectionary forms (辞書形) -- filter class=solr.KuromojiBaseFormFilterFactory/ !-- Optionally remove tokens with certain part-of-speeches filter class=solr.KuromojiPartOfSpeechStopFilterFactory tags=stopTags.txt enablePositionIncrements=true/ -- !-- Normalizes full-width romaji to half-with and half-width kana to full-width (Unicode NFKC subset) -- filter class=solr.CJKWidthFilterFactory/ !-- Lower-case romaji characters -- filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203662#comment-13203662 ] Christian Moen edited comment on SOLR-3056 at 2/8/12 3:45 PM: -- Thanks, Robert. I was thinking to leave the {{StopFilter}} case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease. However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue and just document this clearly in the stopset file. I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping. I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} field type to ignore case in their {{StopFilter}} tomorrow. was (Author: cm): Thanks, Robert. I was thinking to leave the {{StopFilter}} case-sensitive as I thought not having it normalized would give us flexibility, but it's also prone to error and surprises. I think it's reasonable to do make the default ignore case to support adding English or other romaji terms to the stopset with ease. However, if we following down this path path, we might also want to do width-normalization for the Japanese stopset to make sure there's no confusion with that, either. I suggest that we resolve that as a separate issue. I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, though, so that users won't need to reorder the chain in case they want case-sensitive stopping. I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} field type to ignore case in their {{StopFilter}} tomorrow. Introduce Japanese field type in schema.xml --- Key: SOLR-3056 URL: https://issues.apache.org/jira/browse/SOLR-3056 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: SOLR-3056.patch, SOLR-3056_move.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box support in Solr. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? In order to make the below {{text_ja}} field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Any thoughts? {code:xml} !-- Text field type is suitable for Japanese text using morphological analysis NOTE: Please copy files contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar dist/apache-solr-analysis-extras-x.y.z.jar to your Solr lib directory (i.e. example/solr/lib) before before starting Solr. (x.y.z refers to a version number) If you would like to optimize for precision, default operator AND with solrQueryParser defaultOperator=AND/ below (this file). Use OR if you would like to optimize for recall (default). -- fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer !-- Kuromoji Japanese morphological analyzer/tokenizer Use search-mode to get a noun-decompounding effect useful for search. Example: 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport) so we get a match for 空港 (airport) as we would expect from a good search engine Valid values for mode are: normal: default segmentation search: segmentation useful for search (extra compound splitting) extended: search mode with unigramming of unknown words (experimental) NOTE: Search mode improves segmentation for search at the
[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml
[ https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197952#comment-13197952 ] Christian Moen edited comment on SOLR-3056 at 2/1/12 5:06 PM: -- Robert, let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier -- and the part-of-speech tags we'd typically use as stop tags aren't involved with token-splits done by search mode, I don't expect this to be an issue, but it's something to keep in mind. I'll run some tests to verify this and follow up by suggesting configuration. I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configurations. was (Author: cm): Robert, Let's enable stop-words and stop-tags by default. The stopwords list in the Lucene analyzer looks too small unless it's always used in combination with a stoptags filter. I'll look into both of these. Also, if we're using search mode, part-of-speech F will decrease so we might want to rely more on stopwords rather than stoptags if it goes down by a whole lot. However, since tokens agree in 99.7% of the cases based on the tests I did earlier and the part-of-speech tags we'd typically use as stop tags aren't involved with tokens split by search mode, I don't expect this to be a real issue, but it's something to keep in mind. I'll do some testing to verify this and I'll follow up with further improvements to configuration. I'll also open up a separate JIRA for stopwords and stoptags, and aligning the Solr and Lucene default configuration. Introduce Japanese field type in schema.xml --- Key: SOLR-3056 URL: https://issues.apache.org/jira/browse/SOLR-3056 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 3.6, 4.0 Reporter: Christian Moen Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again Robert, Uwe and Simon). It would be very good to get a default field type defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box support in Solr. I've been playing with the below configuration today, which I think is a reasonable starting point for Japanese. There's lot to be said about various considerations necessary when searching Japanese, but perhaps a wiki page is more suitable to cover the wider topic? In order to make the below {{text_ja}} field type work, Kuromoji itself and its analyzers need to be seen by the Solr classloader. However, these are currently in contrib and I'm wondering if we should consider moving them to core to make them directly available. If there are concerns with additional memory usage, etc. for non-Japanese users, we can make sure resources are loaded lazily and only when needed in factory-land. Any thoughts? {code:xml} !-- Text field type is suitable for Japanese text using morphological analysis NOTE: Please copy files contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar dist/apache-solr-analysis-extras-x.y.z.jar to your Solr lib directory (i.e. example/solr/lib) before before starting Solr. (x.y.z refers to a version number) If you would like to optimize for precision, default operator AND with solrQueryParser defaultOperator=AND/ below (this file). Use OR if you would like to optimize for recall (default). -- fieldType name=text_ja class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=false analyzer !-- Kuromoji Japanese morphological analyzer/tokenizer Use search-mode to get a noun-decompounding effect useful for search. Example: 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 (International) 空港 (airport) so we get a match for 空港 (airport) as we would expect from a good search engine Valid values for mode are: normal: default segmentation search: segmentation useful for search (extra compound splitting) extended: search mode with unigramming of unknown words (experimental) NOTE: Search mode improves segmentation for search at the expense of part-of-speech accuracy -- tokenizer class=solr.KuromojiTokenizerFactory