[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

2012-03-29 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241324#comment-13241324
 ] 

Christian Moen edited comment on LUCENE-3935 at 3/29/12 3:51 PM:
-

Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny 
with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) 
match exactly those with the two-dimensional array based on approx. 100,000 
Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on 
{{ConnectionCosts.get()}} after the change.  The method is called very, very 
frequently on indexing, and it's total CPU contribution is ~7-8% _after the 
change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the 
change, hoping that all the connection costs would be in very local cache, but 
this number looks correct to me.  Would be great to get your feedback if this 
is in line with expectations, Dawid and Robert.

Do we still want to apply this?


  was (Author: cm):
Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny 
with fancy encoding of data.  Very impressive!

I've attached a patch and and verified that segmentation (surface forms only) 
match exactly those with the two-dimensional array based on approx. 100,000 
Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on 
{{ConnectionCosts.get()}} after the change.  The method is called very, very 
frequently on indexing, and it's total CPU contribution is ~7-8% _after the 
change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the 
change, but this number looks correct to me.  Would be great to get your 
feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

  
 Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
 ---

 Key: LUCENE-3935
 URL: https://issues.apache.org/jira/browse/LUCENE-3935
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3935.patch


 I've been profiling Kuromoji, and not very surprisingly, method 
 {{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in 
 the Viterbi is called many many times and contributes to more processing time 
 than I had expected.
 This method is currently backed by a {{short[][]}}.  This data stored here 
 structure is a two dimensional array with both dimensions being fixed with 
 1316 elements in each dimension.  (The data is {{matrix.def}} in 
 MeCab-IPADIC.)
 We can rewrite this to use a single one-dimensional array instead, and we 
 will at least save one bounds check, a pointer reference, and we should also 
 get much better cache utilization since this structure is likely to be in 
 very local CPU cache.
 I think this will be a nice optimization.  Working on it... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:00 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h3. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM:
---

.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.


  was (Author: cm):
*Setup*

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:02 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, 
 jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:21 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, 
 jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:25 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:26 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable (!)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:37 PM:
---

.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.


  was (Author: cm):
.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:38 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable (!)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:49 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:59 PM:
---

h5. Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.


  was (Author: cm):
.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:22 PM:
---

h3. Test 1 - Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Attachment || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:21 PM:
---

h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM:
---

h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |

  was (Author: cm):
h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM:
---

h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | GC log |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:25 PM:
---

h3. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |

  was (Author: cm):
h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:46 PM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.  
I'll post details later. :)

  was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching at a 
constant 10 QPS rate.  The queries in this test are identical to those run above

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.  
I'll post details later. :)
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
 ] 

Christian Moen edited comment on SOLR-3282 at 3/28/12 2:37 AM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |




  was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.  
I'll post details later. :)
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png, 
 long-query-indexing-gc.log, long-search-indexing-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
 ] 

Christian Moen edited comment on SOLR-3282 at 3/28/12 2:48 AM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS with highlighting.  The queries in this 
test are identical to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |




  was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |



  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png, 
 long-query-indexing-gc.log, long-search-indexing-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * 

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253
 ] 

Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:44 AM:
--

Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this is possible by changing how we 
emit unknown words, i.e. by not emitting them as greedily and giving the 
lattice more segmentation options.  For example, if we find an unknown word 
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ, 
we'll find a known word, and when the Viterbi runs, it's likely to choose トート 
and バッグ as the best path.

Let me have a play by looking into the lattice details and see if something 
like this is feasible.

  was (Author: cm):
Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this is possible by changing how we 
emit unknown words, i.e. by not emitting them as greedily and giving the 
lattice more segmentation options.  For example, if we find an unknown word 
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ, 
we'll find a known word, and when the Viterbi runs, it's likely to choose トート 
and バッグ as the best path.

Let me have a look at this by looking into the lattice details.
  
 Add decompose compound Japanese Katakana token capability to Kuromoji
 -

 Key: LUCENE-3921
 URL: https://issues.apache.org/jira/browse/LUCENE-3921
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0
 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
  Labels: features

 Japanese morphological analyzer, Kuromoji doesn't have a capability to 
 decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
 that some Katakana tokens can be decomposed, but it cannot be applied every 
 Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ 
 don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary 
 has バッグ in its entry.  I would like to apply the decompose feature to every 
 Katakana tokens if the sub-tokens are in the dictionary or add the capability 
 to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253
 ] 

Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:57 AM:
--

Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this might be possible by changing 
how we emit unknown words, i.e. by not emitting them as greedily and giving the 
lattice more segmentation options.  For example, if we find an unknown word 
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ we'll 
find a known word.  When the Viterbi runs, it's likely to choose トート and バッグ as 
its best path.

Let me have a play by looking into the lattice details and see if something 
like this is feasible.  We are sort of hacking the model here so we also need 
to consider side-effects.

  was (Author: cm):
Hello, Kazu.  Long time no see -- I hope things are well!

This is very good feature request.  I think this is possible by changing how we 
emit unknown words, i.e. by not emitting them as greedily and giving the 
lattice more segmentation options.  For example, if we find an unknown word 
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position.  When we reach the position that starts with バッグ, 
we'll find a known word, and when the Viterbi runs, it's likely to choose トート 
and バッグ as the best path.

Let me have a play by looking into the lattice details and see if something 
like this is feasible.
  
 Add decompose compound Japanese Katakana token capability to Kuromoji
 -

 Key: LUCENE-3921
 URL: https://issues.apache.org/jira/browse/LUCENE-3921
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0
 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
  Labels: features

 Japanese morphological analyzer, Kuromoji doesn't have a capability to 
 decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
 that some Katakana tokens can be decomposed, but it cannot be applied every 
 Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ 
 don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary 
 has バッグ in its entry.  I would like to apply the decompose feature to every 
 Katakana tokens if the sub-tokens are in the dictionary or add the capability 
 to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239195#comment-13239195
 ] 

Christian Moen edited comment on LUCENE-3921 at 3/27/12 5:32 AM:
-

I've been experimenting with the idea outlined above and I thought I should 
share some very early results.

The improvement here is basically to give the compound splitting heuristic an 
improved ability to split unknown words that are part of compounds.  
Experiments I've run using using our compound splitting test cases suggest that 
the effect is indeed positive.  The improved heuristic is able to handle some 
of the test case that we couldn't do earlier, but all of this requires further 
experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also 
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it 
also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).

It might be possible to tune this up or developer a more advanced heuristic 
that remedies this, but I haven't had a chance to look further into this.  
Also, any change here would require extensive testing and validation.  See the 
evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, 
but we can follow up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting 
katakana words that start with ン、ッ、ー since we don't want tokens that start with 
these characters and consider adding this as an option to the tokenizer if it 
works well.

Having said this, there are real limits to what we can achieve by hacking the 
statistical model (and it also affects our karma, you know...).  The approach 
above also has performance and memory impact.  We'd need to introduce a fairly 
short limits to how long unknown words can be and this can perhaps only apply 
to unknown katakana words. The length restriction will be big enough to not 
have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets.  I 
think we'd get pretty far for katakana if we apply some of the corpus-based 
compound-splitting algorithms European NLP researchers have developed.  Some of 
these algorithms are pretty simple and quite effective.

Thoughts?


  was (Author: cm):
I've been experimenting with the idea outlined above and I thought I should 
share some very early results.

The improvement here is basically to give the compound splitting heuristic an 
improved ability to split unknown words that are part of compounds.  
Experiments I've run using using our compound splitting test cases suggest that 
the effect is indeed positive.  The improved heuristic is able to handle some 
of the test case that we couldn't do earlier, but all of this requires further 
experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also 
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it 
also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).

It might be possible to tune this up or developer a more advanced heuristic 
that remedies this, but I haven't had a chance to look further into this.  
Also, any change here would require extensive testing and validation.  See the 
evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6, 
but we can follow up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting 
katakana words that start with ン、ッ、ー since we don't want tokens that start with 
these characters and consider adding this as an option to the tokenizer if it 
works well.

Having said this, there are real limits to what we can achieve by hacking the 
statistical model (and it also affects our karma, you know...).  The approach 
above also has performance and memory impact.  We'd need to introduce a fairly 
short limits to how long unknown words can be and this can perhaps only apply 
to unknown katakana words. The length restriction will be big enough to not 
have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets.  I 
think we'd get pretty far for katakana if we apply some of the corpus-based 
compound-splitting algorithms Europeans NLP researchers have developed.  These 
algorithms are simple and quite effective.

Thoughts?

  
 Add decompose compound Japanese Katakana token capability to Kuromoji
 -

 Key: LUCENE-3921
 URL: 

[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

2012-03-24 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475
 ] 

Christian Moen edited comment on LUCENE-3901 at 3/24/12 8:10 AM:
-

Committed revision 1304727 on {{branch_3x}}.  Fixed a small javadoc issue in 
1304728.

  was (Author: cm):
Committed revision 1304727 on {{branch_3x}}.
  
 Add katakana stem filter to better deal with certain katakana spelling 
 variants
 ---

 Key: LUCENE-3901
 URL: https://issues.apache.org/jira/browse/LUCENE-3901
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Christian Moen
Assignee: Christian Moen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch


 Many Japanese katakana words end in a long sound that is sometimes optional.
 For example, パーティー and パーティ are both perfectly valid for party.  Similarly 
 we have センター and センタ that are variants of center as well as サーバー and サーバ 
 for server.
 I'm proposing that we add a katakana stemmer that removes this long sound if 
 the terms are longer than a configurable length.  It's also possible to add 
 the variant as a synonym, but I think stemming is preferred from a ranking 
 point of view.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

2012-03-24 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475
 ] 

Christian Moen edited comment on LUCENE-3901 at 3/24/12 10:48 AM:
--

Committed revision 1304727 on {{branch_3x}}.  Fixed a small javadoc issue in 
revisions 1304728 and 1304741.

  was (Author: cm):
Committed revision 1304727 on {{branch_3x}}.  Fixed a small javadoc issue 
in 1304728.
  
 Add katakana stem filter to better deal with certain katakana spelling 
 variants
 ---

 Key: LUCENE-3901
 URL: https://issues.apache.org/jira/browse/LUCENE-3901
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Christian Moen
Assignee: Christian Moen
 Fix For: 3.6, 4.0

 Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch


 Many Japanese katakana words end in a long sound that is sometimes optional.
 For example, パーティー and パーティ are both perfectly valid for party.  Similarly 
 we have センター and センタ that are variants of center as well as サーバー and サーバ 
 for server.
 I'm proposing that we add a katakana stemmer that removes this long sound if 
 the terms are longer than a configurable length.  It's also possible to add 
 the variant as a synonym, but I think stemming is preferred from a ranking 
 point of view.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3819) Clean up what we show in right side bar of website.

2012-02-22 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13213740#comment-13213740
 ] 

Christian Moen edited comment on LUCENE-3819 at 2/22/12 4:21 PM:
-

+1, Mark.  +1, Yonik.  

I also think it might be useful to have download shortcuts to Lucene Core 
(Java) and Solr available in the sidebar from http://lucene.apache.org/.  
Perhaps Download could be considered becoming a standard sidebar item for the 
subprojects?

 

  was (Author: cm):
+1, Mark.  +1, Yonik.  

I also think it might be useful to have download shortcuts to Lucene Core 
(Java) and Solr available in the sidebar from http://lucene.apache.org/.  
Perhaps Download could be considered to be a standard sidebar item.  (I quite 
like the red download button, though! :))

 
  
 Clean up what we show in right side bar of website.
 ---

 Key: LUCENE-3819
 URL: https://issues.apache.org/jira/browse/LUCENE-3819
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor

 I'd love to remove a couple things - it's pretty crowded on the right side 
 bar. I find the latest JIRA and email displays are hard to read, tend to 
 format badly, and don't offer much value.
 I'd like to remove them and just leave svn commits and twitter mentions 
 (which are much easier to read and format better). Will help with some info 
 overload on each page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3115) Improve default Japanese stopwords.txt description

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204459#comment-13204459
 ] 

Christian Moen edited comment on SOLR-3115 at 2/9/12 11:34 AM:
---

A patch for {{trunk}} is attached with an improved description for Lucene 
({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}).  (The latter was synched 
using {{sync-analyzers}} -- useful!)

  was (Author: cm):
A patch is attached with an improved description for Lucene 
({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}).  (The latter was synched 
using {{sync-analyzers}} -- useful!)
  
 Improve default Japanese stopwords.txt description
 --

 Key: SOLR-3115
 URL: https://issues.apache.org/jira/browse/SOLR-3115
 Project: Solr
  Issue Type: Improvement
  Components: Rules
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Priority: Minor
 Attachments: SOLR-3115.patch


 As discussed in SOLR-3056, the description in the default Japanese 
 stopwords.txt should be improved to describe case- and width-handling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464
 ] 

Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM:
-

I've updated the patch to now use a {{StopFilter}} that ignores case.

  was (Author: cm):
Updated patch that now uses a {{StopFilter}} that ignores case.
  
 Align default Japanese configurations for Lucene and Solr
 -

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch


 The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration 
 as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464
 ] 

Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM:
-

I've updated the patch to now use a {{StopFilter}} that ignores case.  I think 
this is good to go.

  was (Author: cm):
I've updated the patch to now use a {{StopFilter}} that ignores case.
  
 Align default Japanese configurations for Lucene and Solr
 -

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch


 The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration 
 as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-08 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203616#comment-13203616
 ] 

Christian Moen edited comment on SOLR-3056 at 2/8/12 2:05 PM:
--

Thanks a lot, Robert.

bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the 
Solr and Lucene default configurations.

I created LUCENE-3751 with a patch earlier make sure the default Lucene and 
Solr configurations are aligned.  Sorry for not pointing this out clearly by 
linking the JIRAs.

  was (Author: cm):
Thanks a lot, Robert.

bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the 
Solr and Lucene default configurations.

I created LUCENE-3751 with a patch earlier make sure the default Lucene and 
Solr configurations are aligned.  Sorry for not pointing this out clearly.
  
 Introduce Japanese field type in schema.xml
 ---

 Key: SOLR-3056
 URL: https://issues.apache.org/jira/browse/SOLR-3056
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3056.patch, SOLR-3056_move.patch, 
 SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch


 Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
 Robert, Uwe and Simon). It would be very good to get a default field type 
 defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
 support in Solr.
 I've been playing with the below configuration today, which I think is a 
 reasonable starting point for Japanese.  There's lot to be said about various 
 considerations necessary when searching Japanese, but perhaps a wiki page is 
 more suitable to cover the wider topic?
 In order to make the below {{text_ja}} field type work, Kuromoji itself and 
 its analyzers need to be seen by the Solr classloader.  However, these are 
 currently in contrib and I'm wondering if we should consider moving them to 
 core to make them directly available.  If there are concerns with additional 
 memory usage, etc. for non-Japanese users, we can make sure resources are 
 loaded lazily and only when needed in factory-land.
 Any thoughts?
 {code:xml}
 !-- Text field type is suitable for Japanese text using morphological 
 analysis
  NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
  to your Solr lib directory (i.e. example/solr/lib) before before 
 starting Solr.
  (x.y.z refers to a version number)
  If you would like to optimize for precision, default operator AND with
solrQueryParser defaultOperator=AND/
  below (this file).  Use OR if you would like to optimize for recall 
 (default).
 --
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
   analyzer
 !-- Kuromoji Japanese morphological analyzer/tokenizer
  Use search-mode to get a noun-decompounding effect useful for search.
  Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
 (International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good 
 search engine
  Valid values for mode are:
 normal: default segmentation
 search: segmentation useful for search (extra compound splitting)
   extended: search mode with unigramming of unknown words 
 (experimental)
  NOTE: Search mode improves segmentation for search at the expense of 
 part-of-speech accuracy
 --
 tokenizer class=solr.KuromojiTokenizerFactory mode=search/
 !-- Reduces inflected verbs and adjectives to their base/dectionary 
 forms (辞書形) --  
 filter class=solr.KuromojiBaseFormFilterFactory/
 !-- Optionally remove tokens with certain part-of-speeches
 filter class=solr.KuromojiPartOfSpeechStopFilterFactory 
 tags=stopTags.txt enablePositionIncrements=true/ --
 !-- Normalizes full-width romaji to half-with and half-width kana to 
 full-width (Unicode NFKC subset) --
 filter class=solr.CJKWidthFilterFactory/
 !-- Lower-case romaji characters --
 filter class=solr.LowerCaseFilterFactory/
   /analyzer
 /fieldType
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-08 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203662#comment-13203662
 ] 

Christian Moen edited comment on SOLR-3056 at 2/8/12 3:45 PM:
--

Thanks, Robert.

I was thinking to leave the {{StopFilter}} case-sensitive as I thought not 
having it normalized would give us flexibility, but it's also prone to error 
and surprises.  I think it's reasonable to do make the default ignore case to 
support adding English or other romaji terms to the stopset with ease.

However, if we following down this path path, we might also want to do 
width-normalization for the Japanese stopset to make sure there's no confusion 
with that, either.  I suggest that we resolve that as a separate issue and just 
document this clearly in the stopset file.

I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, 
though, so that users won't need to reorder the chain in case they want 
case-sensitive stopping.

I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} 
field type to ignore case in their {{StopFilter}} tomorrow.

  was (Author: cm):
Thanks, Robert.

I was thinking to leave the {{StopFilter}} case-sensitive as I thought not 
having it normalized would give us flexibility, but it's also prone to error 
and surprises.  I think it's reasonable to do make the default ignore case to 
support adding English or other romaji terms to the stopset with ease.

However, if we following down this path path, we might also want to do 
width-normalization for the Japanese stopset to make sure there's no confusion 
with that, either.  I suggest that we resolve that as a separate issue.

I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, 
though, so that users won't need to reorder the chain in case they want 
case-sensitive stopping.

I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} 
field type to ignore case in their {{StopFilter}} tomorrow.
  
 Introduce Japanese field type in schema.xml
 ---

 Key: SOLR-3056
 URL: https://issues.apache.org/jira/browse/SOLR-3056
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3056.patch, SOLR-3056_move.patch, 
 SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch


 Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
 Robert, Uwe and Simon). It would be very good to get a default field type 
 defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
 support in Solr.
 I've been playing with the below configuration today, which I think is a 
 reasonable starting point for Japanese.  There's lot to be said about various 
 considerations necessary when searching Japanese, but perhaps a wiki page is 
 more suitable to cover the wider topic?
 In order to make the below {{text_ja}} field type work, Kuromoji itself and 
 its analyzers need to be seen by the Solr classloader.  However, these are 
 currently in contrib and I'm wondering if we should consider moving them to 
 core to make them directly available.  If there are concerns with additional 
 memory usage, etc. for non-Japanese users, we can make sure resources are 
 loaded lazily and only when needed in factory-land.
 Any thoughts?
 {code:xml}
 !-- Text field type is suitable for Japanese text using morphological 
 analysis
  NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
  to your Solr lib directory (i.e. example/solr/lib) before before 
 starting Solr.
  (x.y.z refers to a version number)
  If you would like to optimize for precision, default operator AND with
solrQueryParser defaultOperator=AND/
  below (this file).  Use OR if you would like to optimize for recall 
 (default).
 --
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
   analyzer
 !-- Kuromoji Japanese morphological analyzer/tokenizer
  Use search-mode to get a noun-decompounding effect useful for search.
  Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
 (International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good 
 search engine
  Valid values for mode are:
 normal: default segmentation
 search: segmentation useful for search (extra compound splitting)
   extended: search mode with unigramming of unknown words 
 (experimental)
  NOTE: Search mode improves segmentation for search at the 

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-01 Thread Christian Moen (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197952#comment-13197952
 ] 

Christian Moen edited comment on SOLR-3056 at 2/1/12 5:06 PM:
--

Robert, let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always 
used in combination with a stoptags filter.  I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might 
want to rely more on stopwords rather than stoptags if it goes down by a whole 
lot.  However, since tokens agree in 99.7% of the cases based on the tests I 
did earlier -- and the part-of-speech tags we'd typically use as stop tags 
aren't involved with token-splits done by search mode, I don't expect this to 
be an issue, but it's something to keep in mind.

I'll run some tests to verify this and follow up by suggesting configuration.

I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr 
and Lucene default configurations.

  was (Author: cm):
Robert, Let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always 
used in combination with a stoptags filter.  I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might 
want to rely more on stopwords rather than stoptags if it goes down by a whole 
lot.  However, since tokens agree in 99.7% of the cases based on the tests I 
did earlier and the part-of-speech tags we'd typically use as stop tags aren't 
involved with tokens split by search mode, I don't expect this to be a real 
issue, but it's something to keep in mind.

I'll do some testing to verify this and I'll follow up with further 
improvements to configuration.

I'll also open up a separate JIRA for stopwords and stoptags, and aligning the 
Solr and Lucene default configuration.
  
 Introduce Japanese field type in schema.xml
 ---

 Key: SOLR-3056
 URL: https://issues.apache.org/jira/browse/SOLR-3056
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch, 
 SOLR-3056_schema40.patch


 Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
 Robert, Uwe and Simon). It would be very good to get a default field type 
 defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
 support in Solr.
 I've been playing with the below configuration today, which I think is a 
 reasonable starting point for Japanese.  There's lot to be said about various 
 considerations necessary when searching Japanese, but perhaps a wiki page is 
 more suitable to cover the wider topic?
 In order to make the below {{text_ja}} field type work, Kuromoji itself and 
 its analyzers need to be seen by the Solr classloader.  However, these are 
 currently in contrib and I'm wondering if we should consider moving them to 
 core to make them directly available.  If there are concerns with additional 
 memory usage, etc. for non-Japanese users, we can make sure resources are 
 loaded lazily and only when needed in factory-land.
 Any thoughts?
 {code:xml}
 !-- Text field type is suitable for Japanese text using morphological 
 analysis
  NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
  to your Solr lib directory (i.e. example/solr/lib) before before 
 starting Solr.
  (x.y.z refers to a version number)
  If you would like to optimize for precision, default operator AND with
solrQueryParser defaultOperator=AND/
  below (this file).  Use OR if you would like to optimize for recall 
 (default).
 --
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
   analyzer
 !-- Kuromoji Japanese morphological analyzer/tokenizer
  Use search-mode to get a noun-decompounding effect useful for search.
  Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
 (International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good 
 search engine
  Valid values for mode are:
 normal: default segmentation
 search: segmentation useful for search (extra compound splitting)
   extended: search mode with unigramming of unknown words 
 (experimental)
  NOTE: Search mode improves segmentation for search at the expense of 
 part-of-speech accuracy
 --
 tokenizer class=solr.KuromojiTokenizerFactory