from:"Christian Moen \(Issue Comment Edited\) \(JIRA\)"

[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

2012-03-29 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241324#comment-13241324
]

Christian Moen edited comment on LUCENE-3935 at 3/29/12 3:51 PM:
-

Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny
with fancy encoding of data. Very impressive!

I've attached a patch and and verified that segmentation (surface forms only)
match exactly those with the two-dimensional array based on approx. 100,000
Wikipedia articles with XML markup and all, totaling 880MB of data.

Profiling tells me we get a 13% increase in performance on
{{ConnectionCosts.get()}} after the change. The method is called very, very
frequently on indexing, and it's total CPU contribution is ~7-8% _after the
change_, so the net improvement here is not more than a couple of percent.

I was expecting more than a 13% increase in this method's performance after the
change, hoping that all the connection costs would be in very local cache, but
this number looks correct to me. Would be great to get your feedback if this
is in line with expectations, Dawid and Robert.

Do we still want to apply this?

was (Author: cm):
Thanks.

Robert has done a great job making the binary version of {{matrix.def}} tiny
with fancy encoding of data. Very impressive!

I was expecting more than a 13% increase in this method's performance after the
change, but this number looks correct to me. Would be great to get your
feedback if this is in line with expectations, Dawid and Robert.

Do we still want to apply this?

Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method
---

Key: LUCENE-3935
URL: https://issues.apache.org/jira/browse/LUCENE-3935
Project: Lucene - Java
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Attachments: LUCENE-3935.patch

I've been profiling Kuromoji, and not very surprisingly, method
{{ConnectionCosts.get(int forwardId, int backwardId)}} that looks up costs in
the Viterbi is called many many times and contributes to more processing time
than I had expected.
This method is currently backed by a {{short[][]}}. This data stored here
structure is a two dimensional array with both dimensions being fixed with
1316 elements in each dimension. (The data is {{matrix.def}} in
MeCab-IPADIC.)
We can rewrite this to use a single one-dimensional array instead, and we
will at least save one bounds check, a pointer reference, and we should also
get much better cache utilization since this structure is likely to be in
very local CPU cache.
I think this will be a nice optimization. Working on it...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:00 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h3. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 3:59 PM:
---

.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.


  was (Author: cm):
*Setup*

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen

 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:02 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 


  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, 
 jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:21 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine.  There wasn't even a full GC probably like to the large 
heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: jawiki-index-gc.log, jawiki-index-gcviewer.png, 
 jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:25 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:26 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable (!)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size and no fancy GC options (and all of Wikipedia 
searchable)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:37 PM:
---

.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.


  was (Author: cm):
.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:38 PM:
---

h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable (!)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 *

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:49 PM:
---

h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239577#comment-13239577
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 4:59 PM:
---

h5. Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.


  was (Author: cm):
.h5 Test setup

My set up is a MacBook Pro running Mac OS X Lion (10.7) with 8GB memory, a Core 
i7 CPU (4 cores), a 500GB SSD and too many things running.  (The purpose of the 
test is to test stability and not to provide accurate performance numbers, 
although I also hope to do that.)

My java is as follows:

{noformat}
[cm@ayu:~] java -version
java version 1.6.0_26
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11M3527)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
{noformat}

I've added fields body and title to {{schema.xml}} and they're using the 
default Japanese configuration in {{text_ja}}.  The default search field is 
body.

  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239579#comment-13239579
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:22 PM:
---

h3. Test 1 - Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Attachment || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.

  was (Author: cm):
h5. Test 1: Indexing Japanese Wikipedia

In this test I'm only indexing documents -- no searching is being done.

I've extracted text pretty accurately from Japanese Wikipedia and removed all 
the gory markup so the content is clean.  There are 1,443,764 documents in 
total and this is mix of short and very long documents.

These have been converted this to files in Solr XML format and there is 1,000 
documents per file.

I'm running my Solr simply using

{noformat}
java -verbose:gc -Xmx512m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I'm not using any fancy GC options.

I'm posting using 

{noformat}
curl -s http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=UTF-8' --data-binary @solrxml/SolrXml-171.xml
{noformat}

and committing after all the files have been posted with

{noformat}
curl -s http://localhost:8983/solr/update -F 'stream.body= commit /'
{noformat}

Posting the entire Wikipedia in one file is perhaps a lot faster.

Posting took

{noformat}
real18m39.206s
user0m12.682s
sys 0m11.065s
{noformat}

The GC log looks fine with a maximum GC time of 0.0187319 seconds.  There 
wasn't even a full GC probably like to the large heap size.  However, if 
Kuromoji was generating garbage, I'd expect to see it here since input in XML 
format is 1.7GB and the Viterbi would generate data many many times that size 
during tokenization.

I'm attaching these files

|| Filename || Description ||
|jawiki-index-gc.log| GC log |
|jawiki-index-gcviewer.png| Screenshot from GCViewer |
|jawiki-index-visualvm.png| Screenshot from VisualVM | 

Note that GCViewer had problems parsing the log file so the data in the 
screenshot might be off.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 *

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:21 PM:
---

h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h5. Test 2: Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Filename || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM:
---

h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |

  was (Author: cm):
h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239597#comment-13239597
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:23 PM:
---

h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | GC log |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.

  was (Author: cm):
h3. Test 2 - Searching without highlighting (no indexing)

After the Wikipedia index was build, I've ran 250,000 fairly common Japanese 
queries against the index without highlighting and by using simple means.

For this test, I was running Java using

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so - small/normal heap size to keep memory pressure a bit high and no fancy GC 
options -- and all of Wikipedia searchable.  Very nice :)

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84
{noformat}

which is

{noformat}
/solr/select/?q=無料占い
{noformat}

in plain unquoted form.

Running the 250,000 queries took 1838.5 seconds and the test was roughly able 
to keep 80% of its queries within 0.5 second latency and serve a sustained load 
of 142 QPS.

The GC logs have some Full GC entries in them:

|| GC Activity || Time || 
| Full GC 57558K-36262K(126912K) | 0.2926001 secs |
| Full GC 120759K-37151K(126912K) | 0.2948184 secs |
| Full GC 118817K-38305K(126912K) | 0.3726583 secs |
| Full GC 116992K-40203K(126912K) | 0.3688027 secs |
| Full GC 119572K-39070K(126912K) | 0.2896587 secs |
| Full GC 121476K-39257K(126912K) | 0.3034882 secs |
| Full GC 119659K-39451K(126912K) | 0.3078915 secs |
| Full GC 116948K-39770K(126912K) | 0.2407321 secs |
| Full GC 118382K-40442K(126912K) | 0.5224920 secs |

The regular GC entries took a maximum of 0.0731031 seconds, but most half or or 
less.

|| Attachment || Description ||
| 250k-queries-no-highlight-gc.log | Screenshot from GCViewer |
| 250k-queries-no-highlight-visualvm.png | Screenshot from VisualVM |

GCViewer seems to have problems parsing the 250k-queries-no-highlight-gc.log so 
I'm not attaching a screenshot for this.
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239659#comment-13239659
 ] 

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:25 PM:
---

h3. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |

  was (Author: cm):
h5. Test 3 - Searching with highlighting (no indexing)

The test is similar to _Test 2_ with highlighting turned on, but only ~62,000 
queries were run.  No indexing was done.

Solr was run as follows

{noformat}
java -verbose:gc -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

and - again - notice a small heap size and regular GC options.

The queries are on the form

{noformat}
/solr/select/?q=%E7%84%A1%E6%96%99%E5%8D%A0%E3%81%84hl=onhl.fl=body
{noformat}

which is

{noformat}
/solr/select/?q=無料占いhl=onhl.fl=body
{noformat}

in unquoted form.

We have turned on highlighting and we are highlighting on the body field.

The test completes in 1648.1 seconds and 63200 queries were run and the 
sustainable query rate was 47 QPS.

Turning on highlighting has a fairly significant performance penalty if we 
compare QPS to the non-highlighting case where we could sustain 142 QPS.

There is also increased memory pressure with highlighting turned on.  There 
were 652 Full GC events in total in the period and the longest Full GC times is 
given below. 

|| Longest Full GC times (seconds) ||
|0.9769069|
|0.8564934|
|0.7585956|
|0.7084318|
|0.6928327|
|0.6781336|
|0.6358398|
|0.6099899|
|0.5628532|
|0.5540237|
|0.5443075|
|0.5429399|
|0.5423989|
|...|

The extra memory pressure can also be seen in the VisualVM screenshot.  I 
believe the root cause of this is the highlighting.

|| Attachment || Description ||
| 62k-queries-highlight-gc.log|  GC log |
| 62k-queries-highlight-visualvm.png|  Screenshot from VisualVM |
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
]

Christian Moen edited comment on SOLR-3282 at 3/27/12 5:46 PM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS. The queries in this test are identical
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000
documents and - different from _Test 1_ we now do a commit after each post. No
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.
I'll post details later. :)

was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching at a
constant 10 QPS rate. The queries in this test are identical to those run above

Solr is started using

{noformat}
java -verbose:gc -Xmx256m -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000
documents and - different from _Test 1_ we now do a commit after each post. No
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.
I'll post details later. :)

Perform Kuromoji/Japanese stability test before 3.6 freeze
--

Key: SOLR-3282
URL: https://issues.apache.org/jira/browse/SOLR-3282
Project: Solr
Issue Type: Task
Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
Attachments: 250k-queries-no-highlight-gc.log,
250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log,
62k-queries-highlight-visualvm.png, jawiki-index-gc.log,
jawiki-index-gcviewer.png, jawiki-index-visualvm.png

Kuromoji might be used by many and also in mission critical systems. I'd
like to run a stability test before we freeze 3.6.
My thinking is to test the out-of-the-box configuration using fieldtype
{{text_ja}} as follows:
# Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a
never ending loop
# Simultaneously run many tens of thousands typical Japanese queries against
the index at 3-5 queries per second with highlighting turned on
While Solr is indexing and searching, I'd like to verify that:
* Indexing and queries are working as expected
* Memory and heap usage looks stable over time
* Garbage collection is overall low over time -- no Full-GC issues
I'll post findings and results to this JIRA.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
 ] 

Christian Moen edited comment on SOLR-3282 at 3/28/12 2:37 AM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |




  was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test has now been running for 15 minutes and I'll let it run for hours.  
I'll post details later. :)
  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png, 
 long-query-indexing-gc.log, long-search-indexing-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 * Memory and heap usage looks stable over time
 * Garbage collection is overall low over time -- no Full-GC issues
 I'll post findings and results to this JIRA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

2012-03-27 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239714#comment-13239714
 ] 

Christian Moen edited comment on SOLR-3282 at 3/28/12 2:48 AM:
---

h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS with highlighting.  The queries in this 
test are identical to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |




  was (Author: cm):
h3. Test 4 - Combined search and indexing test

In this test, we are both indexing all of Wikipedia while searching.

The search rate is a constant 10 QPS.  The queries in this test are identical 
to those run above and they are also unique.

Solr is started using

{noformat}
java -verbose:gc -Xmx256m  -Dfile.encoding=UTF-8 -jar start.jar
{noformat}

so I've given it a little more heap because of the memory pressure issue seen 
in _Test 3_.

The indexing posts the XML described in _Test 1_ - each file contains 1,000 
documents and - different from _Test 1_ we now do a commit after each post.  No 
optimize is being done.

The test had been running for 8 hours and 33 minutes before I stopped it and 
312,900 queries were run.  Japanese Wikipedia was indexed 23 times.

Full GC occurred 84 times and the maximum heap-size provided to the VM was 
allocated.  The longest Full GC times are given below.

|| Longest Full GC (seconds) ||
|1.0789668|
|1.0518156|
|1.0288781|
|0.9973905|
|0.9799409|
|0.9582144|
|0.9555027|
|0.9517524|
|0.9456611|
|0.9387380|
|0.9313493|
|0.9117388|
|0.8771426|
|...|


The longest regular (non-Full) GC times are below.

|| Longest non-Full GC (seconds) | 
|0.1375324|
|0.1206866|
|0.1009028|
|0.0952712|
|0.0928364|
|...|

The VisualVM screenshot suggests that the VM is nice and stable.  It might be 
good to provide a little more maximum heap-space than 256MB to index all of 
Japanese Wikipedia and serve 10 QPS to have a little more headroom, but 256MB 
seems quite fine.

|| Attachment || Description ||
| long-query-indexing-gc.log | GC log |
| long-search-indexing-visualvm.png | VisualVM screenshot |



  
 Perform Kuromoji/Japanese stability test before 3.6 freeze
 --

 Key: SOLR-3282
 URL: https://issues.apache.org/jira/browse/SOLR-3282
 Project: Solr
  Issue Type: Task
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Assignee: Christian Moen
 Attachments: 250k-queries-no-highlight-gc.log, 
 250k-queries-no-highlight-visualvm.png, 62k-queries-highlight-gc.log, 
 62k-queries-highlight-visualvm.png, jawiki-index-gc.log, 
 jawiki-index-gcviewer.png, jawiki-index-visualvm.png, 
 long-query-indexing-gc.log, long-search-indexing-visualvm.png


 Kuromoji might be used by many and also in mission critical systems.  I'd 
 like to run a stability test before we freeze 3.6.
 My thinking is to test the out-of-the-box configuration using fieldtype 
 {{text_ja}} as follows:
 # Index all of Japanese Wikipedia documents (approx. 1.4M documents) in a 
 never ending loop
 # Simultaneously run many tens of thousands typical Japanese queries against 
 the index at 3-5 queries per second with highlighting turned on
 While Solr is indexing and searching, I'd like to verify that:
 * Indexing and queries are working as expected
 *

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253
]

Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:44 AM:
--

Hello, Kazu. Long time no see -- I hope things are well!

This is very good feature request. I think this is possible by changing how we
emit unknown words, i.e. by not emitting them as greedily and giving the
lattice more segmentation options. For example, if we find an unknown word
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position. When we reach the position that starts with バッグ,
we'll find a known word, and when the Viterbi runs, it's likely to choose トート
and バッグ as the best path.

Let me have a play by looking into the lattice details and see if something
like this is feasible.

was (Author: cm):
Hello, Kazu. Long time no see -- I hope things are well!

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position. When we reach the position that starts with バッグ,
we'll find a known word, and when the Viterbi runs, it's likely to choose トート
and バッグ as the best path.

Let me have a look at this by looking into the lattice details.

Add decompose compound Japanese Katakana token capability to Kuromoji
-

Key: LUCENE-3921
URL: https://issues.apache.org/jira/browse/LUCENE-3921
Project: Lucene - Java
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 4.0
Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
Labels: features

Japanese morphological analyzer, Kuromoji doesn't have a capability to
decompose every Japanese Katakana compound tokens to sub-tokens. It seems
that some Katakana tokens can be decomposed, but it cannot be applied every
Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ
don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary
has バッグ in its entry. I would like to apply the decompose feature to every
Katakana tokens if the sub-tokens are in the dictionary or add the capability
to force apply the decompose feature to every Katakana tokens.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238253#comment-13238253
]

Christian Moen edited comment on LUCENE-3921 at 3/26/12 10:57 AM:
--

Hello, Kazu. Long time no see -- I hope things are well!

This is very good feature request. I think this might be possible by changing
how we emit unknown words, i.e. by not emitting them as greedily and giving the
lattice more segmentation options. For example, if we find an unknown word
トートバッグ (by regular greedy matching), we can emit

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position. When we reach the position that starts with バッグ we'll
find a known word. When the Viterbi runs, it's likely to choose トート and バッグ as
its best path.

Let me have a play by looking into the lattice details and see if something
like this is feasible. We are sort of hacking the model here so we also need
to consider side-effects.

was (Author: cm):
Hello, Kazu. Long time no see -- I hope things are well!

{noformat}
ト
トー
トート
トートバ
トートバッ
トートバッグ
{noformat}

in the current position. When we reach the position that starts with バッグ,
we'll find a known word, and when the Viterbi runs, it's likely to choose トート
and バッグ as the best path.

Let me have a play by looking into the lattice details and see if something
like this is feasible.

Add decompose compound Japanese Katakana token capability to Kuromoji
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-03-26 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239195#comment-13239195
]

Christian Moen edited comment on LUCENE-3921 at 3/27/12 5:32 AM:
-

I've been experimenting with the idea outlined above and I thought I should
share some very early results.

The improvement here is basically to give the compound splitting heuristic an
improved ability to split unknown words that are part of compounds.
Experiments I've run using using our compound splitting test cases suggest that
the effect is indeed positive. The improved heuristic is able to handle some
of the test case that we couldn't do earlier, but all of this requires further
experimentation and validation.

I've been able to segment トートバッグ (tote bag with トート being unknown) and also
ショルダーバッグ (shoulder bag) as you would like with some weight tweaks, but then it
also segmented エンジニアリング (engineering) into エンジニア (engineer) リング (ring).

It might be possible to tune this up or developer a more advanced heuristic
that remedies this, but I haven't had a chance to look further into this.
Also, any change here would require extensive testing and validation. See the
evaluation attached to LUCENE-3726 that was done on Wikipedia for search mode.

Please note that there will not be time to provide improvements here for 3.6,
but we can follow up on katakana segmentation for 4.0.

With the above idea for katakana in mind, I'm thinking we can skip emitting
katakana words that start with ン、ッ、ー since we don't want tokens that start with
these characters and consider adding this as an option to the tokenizer if it
works well.

Having said this, there are real limits to what we can achieve by hacking the
statistical model (and it also affects our karma, you know...). The approach
above also has performance and memory impact. We'd need to introduce a fairly
short limits to how long unknown words can be and this can perhaps only apply
to unknown katakana words. The length restriction will be big enough to not
have any practical impact on segmentation, though.

An alternative approach to all of this is to build some lexical assets. I
think we'd get pretty far for katakana if we apply some of the corpus-based
compound-splitting algorithms European NLP researchers have developed. Some of
these algorithms are pretty simple and quite effective.

Thoughts?

was (Author: cm):
I've been experimenting with the idea outlined above and I thought I should
share some very early results.

Please note that there will not be time to provide improvements here for 3.6,
but we can follow up on katakana segmentation for 4.0.

An alternative approach to all of this is to build some lexical assets. I
think we'd get pretty far for katakana if we apply some of the corpus-based
compound-splitting algorithms Europeans NLP researchers have developed. These
algorithms are simple and quite effective.

Thoughts?

Add decompose compound Japanese Katakana token capability to Kuromoji
-

Key: LUCENE-3921
URL:

[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

2012-03-24 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475
]

Christian Moen edited comment on LUCENE-3901 at 3/24/12 8:10 AM:
-

Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue in
1304728.

was (Author: cm):
Committed revision 1304727 on {{branch_3x}}.

Add katakana stem filter to better deal with certain katakana spelling
variants
---

Key: LUCENE-3901
URL: https://issues.apache.org/jira/browse/LUCENE-3901
Project: Lucene - Java
Issue Type: New Feature
Components: modules/analysis
Reporter: Christian Moen
Assignee: Christian Moen
Fix For: 3.6, 4.0

Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch

Many Japanese katakana words end in a long sound that is sometimes optional.
For example, パーティー and パーティ are both perfectly valid for party. Similarly
we have センター and センタ that are variants of center as well as サーバー and サーバ
for server.
I'm proposing that we add a katakana stemmer that removes this long sound if
the terms are longer than a configurable length. It's also possible to add
the variant as a synonym, but I think stemming is preferred from a ranking
point of view.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

2012-03-24 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237475#comment-13237475
]

Christian Moen edited comment on LUCENE-3901 at 3/24/12 10:48 AM:
--

Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue in
revisions 1304728 and 1304741.

was (Author: cm):
Committed revision 1304727 on {{branch_3x}}. Fixed a small javadoc issue
in 1304728.

Add katakana stem filter to better deal with certain katakana spelling
variants
---

Attachments: LUCENE-3901.patch, LUCENE-3901.patch, LUCENE-3901.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3819) Clean up what we show in right side bar of website.

2012-02-22 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13213740#comment-13213740
]

Christian Moen edited comment on LUCENE-3819 at 2/22/12 4:21 PM:
-

+1, Mark. +1, Yonik.

I also think it might be useful to have download shortcuts to Lucene Core
(Java) and Solr available in the sidebar from http://lucene.apache.org/.
Perhaps Download could be considered becoming a standard sidebar item for the
subprojects?

was (Author: cm):
+1, Mark. +1, Yonik.

I also think it might be useful to have download shortcuts to Lucene Core
(Java) and Solr available in the sidebar from http://lucene.apache.org/.
Perhaps Download could be considered to be a standard sidebar item. (I quite
like the red download button, though! :))

Clean up what we show in right side bar of website.
---

Key: LUCENE-3819
URL: https://issues.apache.org/jira/browse/LUCENE-3819
Project: Lucene - Java
Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor

I'd love to remove a couple things - it's pretty crowded on the right side
bar. I find the latest JIRA and email displays are hard to read, tend to
format badly, and don't offer much value.
I'd like to remove them and just leave svn commits and twitter mentions
(which are much easier to read and format better). Will help with some info
overload on each page.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3115) Improve default Japanese stopwords.txt description

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204459#comment-13204459
 ] 

Christian Moen edited comment on SOLR-3115 at 2/9/12 11:34 AM:
---

A patch for {{trunk}} is attached with an improved description for Lucene 
({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}).  (The latter was synched 
using {{sync-analyzers}} -- useful!)

  was (Author: cm):
A patch is attached with an improved description for Lucene 
({{stopwords.txt}}) and Solr ({{stopwords_ja.txt}}).  (The latter was synched 
using {{sync-analyzers}} -- useful!)
  
 Improve default Japanese stopwords.txt description
 --

 Key: SOLR-3115
 URL: https://issues.apache.org/jira/browse/SOLR-3115
 Project: Solr
  Issue Type: Improvement
  Components: Rules
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Priority: Minor
 Attachments: SOLR-3115.patch


 As discussed in SOLR-3056, the description in the default Japanese 
 stopwords.txt should be improved to describe case- and width-handling.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464
 ] 

Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM:
-

I've updated the patch to now use a {{StopFilter}} that ignores case.

  was (Author: cm):
Updated patch that now uses a {{StopFilter}} that ignores case.
  
 Align default Japanese configurations for Lucene and Solr
 -

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch


 The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration 
 as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

2012-02-09 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204464#comment-13204464
 ] 

Christian Moen edited comment on LUCENE-3751 at 2/9/12 11:53 AM:
-

I've updated the patch to now use a {{StopFilter}} that ignores case.  I think 
this is good to go.

  was (Author: cm):
I've updated the patch to now use a {{StopFilter}} that ignores case.
  
 Align default Japanese configurations for Lucene and Solr
 -

 Key: LUCENE-3751
 URL: https://issues.apache.org/jira/browse/LUCENE-3751
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: LUCENE-3751.patch, LUCENE-3751.patch, LUCENE-3751.patch


 The {{KuromojiAnalyzer}} in Lucene shoud have the same default configuration 
 as the {{text_ja}} field type introduced in {{schema.xml}} by SOLR-3056.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-08 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203616#comment-13203616
]

Christian Moen edited comment on SOLR-3056 at 2/8/12 2:05 PM:
--

Thanks a lot, Robert.

bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the
Solr and Lucene default configurations.

I created LUCENE-3751 with a patch earlier make sure the default Lucene and
Solr configurations are aligned. Sorry for not pointing this out clearly by
linking the JIRAs.

was (Author: cm):
Thanks a lot, Robert.

bq. I'll open up a separate JIRA for stopwords and stoptags, and aligning the
Solr and Lucene default configurations.

I created LUCENE-3751 with a patch earlier make sure the default Lucene and
Solr configurations are aligned. Sorry for not pointing this out clearly.

Introduce Japanese field type in schema.xml
---

Key: SOLR-3056
URL: https://issues.apache.org/jira/browse/SOLR-3056
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Attachments: SOLR-3056.patch, SOLR-3056_move.patch,
SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch

Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again
Robert, Uwe and Simon). It would be very good to get a default field type
defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box
support in Solr.
I've been playing with the below configuration today, which I think is a
reasonable starting point for Japanese. There's lot to be said about various
considerations necessary when searching Japanese, but perhaps a wiki page is
more suitable to cover the wider topic?
In order to make the below {{text_ja}} field type work, Kuromoji itself and
its analyzers need to be seen by the Solr classloader. However, these are
currently in contrib and I'm wondering if we should consider moving them to
core to make them directly available. If there are concerns with additional
memory usage, etc. for non-Japanese users, we can make sure resources are
loaded lazily and only when needed in factory-land.
Any thoughts?
{code:xml}
!-- Text field type is suitable for Japanese text using morphological
analysis
NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
to your Solr lib directory (i.e. example/solr/lib) before before
starting Solr.
(x.y.z refers to a version number)
If you would like to optimize for precision, default operator AND with
solrQueryParser defaultOperator=AND/
below (this file). Use OR if you would like to optimize for recall
(default).
--
fieldType name=text_ja class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=false
analyzer
!-- Kuromoji Japanese morphological analyzer/tokenizer
Use search-mode to get a noun-decompounding effect useful for search.
Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際
(International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good
search engine
Valid values for mode are:
normal: default segmentation
search: segmentation useful for search (extra compound splitting)
extended: search mode with unigramming of unknown words
(experimental)
NOTE: Search mode improves segmentation for search at the expense of
part-of-speech accuracy
--
tokenizer class=solr.KuromojiTokenizerFactory mode=search/
!-- Reduces inflected verbs and adjectives to their base/dectionary
forms (辞書形) --
filter class=solr.KuromojiBaseFormFilterFactory/
!-- Optionally remove tokens with certain part-of-speeches
filter class=solr.KuromojiPartOfSpeechStopFilterFactory
tags=stopTags.txt enablePositionIncrements=true/ --
!-- Normalizes full-width romaji to half-with and half-width kana to
full-width (Unicode NFKC subset) --
filter class=solr.CJKWidthFilterFactory/
!-- Lower-case romaji characters --
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType
{code}

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-08 Thread Christian Moen (Issue Comment Edited) (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203662#comment-13203662
 ] 

Christian Moen edited comment on SOLR-3056 at 2/8/12 3:45 PM:
--

Thanks, Robert.

I was thinking to leave the {{StopFilter}} case-sensitive as I thought not 
having it normalized would give us flexibility, but it's also prone to error 
and surprises.  I think it's reasonable to do make the default ignore case to 
support adding English or other romaji terms to the stopset with ease.

However, if we following down this path path, we might also want to do 
width-normalization for the Japanese stopset to make sure there's no confusion 
with that, either.  I suggest that we resolve that as a separate issue and just 
document this clearly in the stopset file.

I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, 
though, so that users won't need to reorder the chain in case they want 
case-sensitive stopping.

I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} 
field type to ignore case in their {{StopFilter}} tomorrow.

  was (Author: cm):
Thanks, Robert.

I was thinking to leave the {{StopFilter}} case-sensitive as I thought not 
having it normalized would give us flexibility, but it's also prone to error 
and surprises.  I think it's reasonable to do make the default ignore case to 
support adding English or other romaji terms to the stopset with ease.

However, if we following down this path path, we might also want to do 
width-normalization for the Japanese stopset to make sure there's no confusion 
with that, either.  I suggest that we resolve that as a separate issue.

I think it's still reasonable to leave the {{LowerCaseFilter}} last as-is, 
though, so that users won't need to reorder the chain in case they want 
case-sensitive stopping.

I'll update the configuration in both {{KuromojiAnalyzer}} and the {{text_ja}} 
field type to ignore case in their {{StopFilter}} tomorrow.
  
 Introduce Japanese field type in schema.xml
 ---

 Key: SOLR-3056
 URL: https://issues.apache.org/jira/browse/SOLR-3056
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
 Attachments: SOLR-3056.patch, SOLR-3056_move.patch, 
 SOLR-3056_schema40.patch, SOLR-3056_schema40.patch, SOLR-3056_schema40.patch


 Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
 Robert, Uwe and Simon). It would be very good to get a default field type 
 defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
 support in Solr.
 I've been playing with the below configuration today, which I think is a 
 reasonable starting point for Japanese.  There's lot to be said about various 
 considerations necessary when searching Japanese, but perhaps a wiki page is 
 more suitable to cover the wider topic?
 In order to make the below {{text_ja}} field type work, Kuromoji itself and 
 its analyzers need to be seen by the Solr classloader.  However, these are 
 currently in contrib and I'm wondering if we should consider moving them to 
 core to make them directly available.  If there are concerns with additional 
 memory usage, etc. for non-Japanese users, we can make sure resources are 
 loaded lazily and only when needed in factory-land.
 Any thoughts?
 {code:xml}
 !-- Text field type is suitable for Japanese text using morphological 
 analysis
  NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
  to your Solr lib directory (i.e. example/solr/lib) before before 
 starting Solr.
  (x.y.z refers to a version number)
  If you would like to optimize for precision, default operator AND with
solrQueryParser defaultOperator=AND/
  below (this file).  Use OR if you would like to optimize for recall 
 (default).
 --
 fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
 autoGeneratePhraseQueries=false
   analyzer
 !-- Kuromoji Japanese morphological analyzer/tokenizer
  Use search-mode to get a noun-decompounding effect useful for search.
  Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
 (International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good 
 search engine
  Valid values for mode are:
 normal: default segmentation
 search: segmentation useful for search (extra compound splitting)
   extended: search mode with unigramming of unknown words 
 (experimental)
  NOTE: Search mode improves segmentation for search at the

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

2012-02-01 Thread Christian Moen (Issue Comment Edited) (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197952#comment-13197952
]

Christian Moen edited comment on SOLR-3056 at 2/1/12 5:06 PM:
--

Robert, let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always
used in combination with a stoptags filter. I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might
want to rely more on stopwords rather than stoptags if it goes down by a whole
lot. However, since tokens agree in 99.7% of the cases based on the tests I
did earlier -- and the part-of-speech tags we'd typically use as stop tags
aren't involved with token-splits done by search mode, I don't expect this to
be an issue, but it's something to keep in mind.

I'll run some tests to verify this and follow up by suggesting configuration.

I'll open up a separate JIRA for stopwords and stoptags, and aligning the Solr
and Lucene default configurations.

was (Author: cm):
Robert, Let's enable stop-words and stop-tags by default.

The stopwords list in the Lucene analyzer looks too small unless it's always
used in combination with a stoptags filter. I'll look into both of these.

Also, if we're using search mode, part-of-speech F will decrease so we might
want to rely more on stopwords rather than stoptags if it goes down by a whole
lot. However, since tokens agree in 99.7% of the cases based on the tests I
did earlier and the part-of-speech tags we'd typically use as stop tags aren't
involved with tokens split by search mode, I don't expect this to be a real
issue, but it's something to keep in mind.

I'll do some testing to verify this and I'll follow up with further
improvements to configuration.

I'll also open up a separate JIRA for stopwords and stoptags, and aligning the
Solr and Lucene default configuration.

Introduce Japanese field type in schema.xml
---

Key: SOLR-3056
URL: https://issues.apache.org/jira/browse/SOLR-3056
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Attachments: SOLR-3056_move.patch, SOLR-3056_schema40.patch,
SOLR-3056_schema40.patch

[jira] [Issue Comment Edited] (LUCENE-3935) Optimize Kuromoji inner loop - rewrite ConnectionCosts.get() method

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (SOLR-3282) Perform Kuromoji/Japanese stability test before 3.6 freeze

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

[jira] [Issue Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

[jira] [Issue Comment Edited] (LUCENE-3901) Add katakana stem filter to better deal with certain katakana spelling variants

[jira] [Issue Comment Edited] (LUCENE-3819) Clean up what we show in right side bar of website.

[jira] [Issue Comment Edited] (SOLR-3115) Improve default Japanese stopwords.txt description

[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

[jira] [Issue Comment Edited] (LUCENE-3751) Align default Japanese configurations for Lucene and Solr

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

[jira] [Issue Comment Edited] (SOLR-3056) Introduce Japanese field type in schema.xml

32 matches

Site Navigation

Mail list logo

Footer information