from:"Lance Norskog $JIRA$"

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2018-04-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428665#comment-16428665
 ] 

Lance Norskog commented on LUCENE-2899:
---

I apologize, [~Fatalityap], but I cannot help here. I have not worked with Solr 
for a few years.



> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Steve Rowe
>Priority: Minor
> Fix For: 7.3, master (8.0)
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2018-04-05 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426706#comment-16426706
 ] 

Lance Norskog commented on LUCENE-2899:
---

No, I think you may need to copy the model files to the right directory on
each SolrCloud server via your own custom script.
Or, have the files on a network share and then mount that share on each
SolrCloud server, using the same letter on all servers.

On Thu, Apr 5, 2018 at 1:31 AM, Alexey Ponomarenko (JIRA) 




-- 
Lance Norskog
lance.nors...@gmail.com
Redwood City, CA


> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Steve Rowe
>Priority: Minor
> Fix For: 7.3, master (8.0)
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2018-04-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426256#comment-16426256
 ] 

Lance Norskog commented on LUCENE-2899:
---

I'm so cheered up that [~steve_rowe] picked this up and added it to Solr!

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Steve Rowe
>Priority: Minor
> Fix For: 7.3, master (8.0)
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2018-04-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426251#comment-16426251
 ] 

Lance Norskog commented on LUCENE-2899:
---

The last time I read up on ZK, files are limited to 1mb. The ZK "file system" 
is intended for small configuration files. NLP models can be many megabytes. 
You might need an alternate path (scp) to distribute NLP models. On Windows, 
SMB file sharing.

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Steve Rowe
>Priority: Minor
> Fix For: 7.3, master (8.0)
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2016-10-21 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15596885#comment-15596885
 ] 

Lance Norskog commented on LUCENE-2899:
---

I don't remember if it's always or just seldom. It was just something I noticed 
when testing them. I'm not an NLP researcher, and I've been out of the Solr 
world for years. It sounds like Joern Kottman knows his way around this stuff.


> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2016-07-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396746#comment-15396746
 ] 

Lance Norskog commented on LUCENE-2899:
---

It's really cool to see someone with clout pick this up & modernize it.

Cheers,

Lance Norskog

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2015-10-10 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952086#comment-14952086
 ] 

Lance Norskog commented on LUCENE-2899:
---

I don't work in this area anymore. Somebody else will have to make an 
up-to-date patch,  and you need to find a committer to be a champion for it.

A tech report of a real-life deployment would be a great way to persuade 
someone.



> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 4.9, Trunk
>
> Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
> OpenNLPFilter.java, OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-12-30 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859006#comment-13859006
]

Lance Norskog commented on LUCENE-2899:
---

All fair criticisms.

About UIMA: clearly it is much more advanced than this design, but I'm not
smart enough to use it :) I've tried to put together something useful (a few
times) and each time was completely confused. I learn by example, and the
examples are limited. Also there is very little traffic on the mailing lists
etc. about UIMA.

About payloads v.s. internal attributes: the examples don't use this feature,
but payloads are stored in the index. This supports a question-answering
system. Add PERSON payloads with all records, then search for word X AND
'payload PERSON anywhere' when someone says who is X. This does the tagging
during indexing, but not searching. A better design would be to add PERSON as a
synonym rather than a payload. I also don't see much traffic about payloads.

About doing this in the analysis pipeline v.s. upstream: yes, upstream request
processors are the right place for this. In Solr. URPs don't exist in ES or
just plain Lucene coding.

Add OpenNLP Analysis capabilities as a module
-

Key: LUCENE-2899
URL: https://issues.apache.org/jira/browse/LUCENE-2899
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
Fix For: 4.7

Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch,
OpenNLPFilter.java, OpenNLPTokenizer.java

Now that OpenNLP is an ASF project and has a nice license, it would be nice
to have a submodule (under analysis) that exposed capabilities for it. Drew
Farris, Tom Morton and I have code that does:
* Sentence Detection as a Tokenizer (could also be a TokenFilter, although it
would have to change slightly to buffer tokens)
* NamedEntity recognition as a TokenFilter
We are also planning a Tokenizer/TokenFilter that can put parts of speech as
either payloads (PartOfSpeechAttribute?) on a token or at the same position.
I'd propose it go under:
modules/analysis/opennlp

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-12-30 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859006#comment-13859006
]

Lance Norskog edited comment on LUCENE-2899 at 12/30/13 7:19 PM:
-

All fair criticisms.

About UIMA: clearly it is much more advanced than this design, but I'm not
smart enough to use it :). I've tried to put together something useful (a few
times) and each time was completely confused. I learn by example, and the
examples are limited. Also there is very little traffic on the mailing lists
etc. about UIMA.

About doing this in the analysis pipeline v.s. upstream: yes, upstream request
processors are the right place for this. In Solr. URPs don't exist in ES or
just plain Lucene coding.

was (Author: lancenorskog):
All fair criticisms.

About doing this in the analysis pipeline v.s. upstream: yes, upstream request
processors are the right place for this. In Solr. URPs don't exist in ES or
just plain Lucene coding.

Add OpenNLP Analysis capabilities as a module
-

Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch,
OpenNLPFilter.java, OpenNLPTokenizer.java

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-12-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859009#comment-13859009
 ] 

Lance Norskog commented on LUCENE-2899:
---

JWNL is WordNet. Lucene has a WordNet parser for use as a synonym filter.
http://lucene.apache.org/core/4_0_0/analyzers-common/index.html?org/apache/lucene/analysis/synonym/SynonymMap.html

I don't know how to use this from a Solr filter factory. Please ask this on the 
Solr mailing list.


 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.7

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-07 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13816277#comment-13816277
 ] 

Lance Norskog commented on LUCENE-2899:
---

The solrconfig.xml file should have these lines in the library set:
{quote}
  lib dir=../../../contrib/opennlp/lib regex=.*\.jar /
  lib dir=../../../dist/ regex=solr-opennlp-\d.*\.jar /
{quote}

Also, you have to copy 
{{lucene/build/analysis/opennlp/lucene-analyzers-opennlp*.jar}} to 
{{solr/contrib/opennlp/lib/} .

This last problem was a mess. I have not followed these issues: [SOLR-3664], 
[LUCENE-5249], [LUCENE-5257]. I don't know if they handle the problem I 
described. Shipping this thing as a Lucene/Solr contrib module patch was a 
mistake- it intersects the buildcode structure in too many places.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899.patch

This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-current.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899-x.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: opennlp_trunk.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-current.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
 OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: OpenNLPFilter.java)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899.patch, OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812581#comment-13812581
 ] 

Lance Norskog edited comment on LUCENE-2899 at 11/4/13 2:55 AM:


This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues. To avoid confusion, I have removed all 
old patches.


was (Author: lancenorskog):
This patch includes a fix for the problem where searching twice doesn't work. 
The file is LUCENE-2899.patch 
It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-11-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: (was: LUCENE-2899-x.patch)

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-RJN.patch, LUCENE-2899.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-10-23 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802982#comment-13802982
 ] 

Lance Norskog commented on LUCENE-2899:
---

Hi-

The latest patch is LUCENE-2899-x.patch, pls try that. Also, apply it with:
patch -p0  patchfile

Lance




 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.6

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-08-03 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13728617#comment-13728617
 ] 

Lance Norskog commented on LUCENE-2899:
---

Wow! Brat looks bitchin! Looking forward to using it.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 5.0, 4.5

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-07-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13725911#comment-13725911
 ] 

Lance Norskog commented on LUCENE-2899:
---

Yup! Another NER is always helpful.  But the big problem with NLP software is 
not the code but the models- do you have a good source of free models? 

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 5.0, 4.5

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-16 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

Fixed the Chunker problem. I switched to the new released version of the 
OpenNLP packages. The MaxEnt implementation (statistical modeling) for chunking 
changed slightly, and my test data now produces different nounverb phrase 
chunks for the sample text.

At this point the only problems I know of are that the licenses are slightly 
wrong, and so 
'ant validate' fails.

These comments only apply to LUCENE-2899-x.patch, which applies to the current 
4.x and trunk codelines. LUCENE-2899.patch applies to the release 4.0-4.3 
releases. It is not upgraded to the new OpenNLP release.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, LUCENE-2899-x.patch, OpenNLPFilter.java, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-10 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679293#comment-13679293
 ] 

Lance Norskog edited comment on LUCENE-2899 at 6/10/13 8:56 AM:


I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.



  was (Author: lancenorskog):
I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. Substitute this OpenNLPFilter.java for your version and see if that 
fixes the problem for you.
  
 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-10 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679293#comment-13679293
 ] 

Lance Norskog edited comment on LUCENE-2899 at 6/10/13 5:45 PM:


I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.

A-a-a-a-a-a-n-n-n-n-d chunking is broken. Oy.



  was (Author: lancenorskog):
I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. I have attached a fixed version of this to this issue. Please try it 
and see if it fixes what you see.


  
 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-09 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: OpenNLPFilter.java

I did not make the right changes to OpenNLPFilter.java to handle the API 
changes. Substitute this OpenNLPFilter.java for your version and see if that 
fixes the problem for you.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-06 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677336#comment-13677336
 ] 

Lance Norskog commented on LUCENE-2899:
---

Yup- upgrading to 1.5.3 is next on the list.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPTokenizer.java, 
 opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-05 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676698#comment-13676698
 ] 

Lance Norskog commented on LUCENE-2899:
---

I found the problem with multiple documents. The API for reusing Tokenizers 
changed something more sensible, but I only noticed and implemented part of the 
change. The result was than when you upload multiple documents, it just 
re-processed the first document.

File LUCENE-2899-x.patch has this fix. It applies against the 4.x branch and 
the trunk. It does not apply against Lucene 4.0, 4.1, 4.2 or 4.3. For all 
released Solr versions you want LUCENE-2899.patch from August 27, 2012. There 
are no new features since that release.

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-05 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 LUCENE-2899-x.patch, OpenNLPFilter.java, OpenNLPTokenizer.java, 
 opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-06-03 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-x.patch

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, LUCENE-2899-x.patch, 
 OpenNLPFilter.java, OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-05-19 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13661758#comment-13661758
]

Lance Norskog commented on LUCENE-2899:
---

I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did
not attempt to analyse text that is longer than the fixed size temp buffer, and
thus the code for copying successive buffers was never exercised. Kai's fix
handles this problem. I've added a unit test.

Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a
Reader, and each call to incrementToken() walks the input. When
incrementToken() returns false, that is all- the Tokenizer is finished.
TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call
incrementToken() until it returns false, and then you can call 'reset' and it
will start over from the beginning. The unit tests include a check that reset()
works. The changes you made support a feature that is not supported by Lucene.
Also, the changes break most of the unit tests. Please create a unit test that
shows the bug, and fix the existing unit tests. No unit test = no bug report.

I'm posting a patch for the current 4.x and trunk. It includes some changes for
TokenStream/TokenFilter method signatures, some refactoring in the unit tests,
a little tightening in the Tokenizer Filter, and Kai's fix. There are unit
tests for the problem Kai found, and also a test that has TokenizerFactory
create multiple Tokenizer streams. If there is a bug in this patch, please
write a unit test which demonstrates it.

The patch is called LUCENE-2899-current.patch. It is tested against the current
4.x branch and the current trunk.

Thanks for your interest and hard work- I know it is really tedious to
understand this code :)

Lance Norskog

Add OpenNLP Analysis capabilities as a module
-

Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch,
LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch,
LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java,
OpenNLPTokenizer.java, opennlp_trunk.patch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-05-19 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2899:
--

Attachment: LUCENE-2899-current.patch

 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.4

 Attachments: LUCENE-2899-current.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-04-25 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641968#comment-13641968
]

Lance Norskog commented on LUCENE-2899:
---

Maciej- This is a good point. This package needs changes in a lot of places and
it might be easier to package it the way you say.

Zack- The churn in the APIs is a major problem in the Lucene code management.
The original patch worked in the 4.x branch and trunk when it was posted. What
Em fixed is in an area which is very very basic to Lucene. The API changed with
no notice and no change in versions or method names.

Everyone- It's great that this has gained some interest. Please create a new
master patch with whatever changes are needed for the current code base.

Lucene grand masters- Please don't say hey kids, write plugins, they're cool!
and then make subtle incompatible changes in APIs.

Add OpenNLP Analysis capabilities as a module
-

Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch,
LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch,
LUCENE-2899-RJN.patch, OpenNLPFilter.java, OpenNLPTokenizer.java,
opennlp_trunk.patch

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-1413) Add MockSolrServer to SolrJ client tests

2013-04-17 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lance Norskog closed SOLR-1413.
---

Resolution: Implemented
Fix Version/s: 3.3

The test infrastructure has had a huge upgrade since 3 years ago. This is no
longer a valid thang.

Add MockSolrServer to SolrJ client tests

Key: SOLR-1413
URL: https://issues.apache.org/jira/browse/SOLR-1413
Project: Solr
Issue Type: Test
Components: clients - java
Environment: Any Solr distribution. Uses only the SolrJ client code,
nothing in the Solr core.
Reporter: Lance Norskog
Priority: Minor
Fix For: 3.3

Attachments: SOLR-1413.patch, SOLR-1413.patch

The SolrJ unit test suite has no mock solr server for HTTP access, and
there are no low-level tests of the Solrj HTTP wire protocols.
This patch includes org.apache.solr.client.solrj.MockHTTPServer.java and
org.apache.solr.client.solrj.TestHTTP_XML_single.java. The mock server does
not parse its input and responds with pre-configured byte streams. The latter
does a few tests in the XML wire format. Most of the tests do one request and
set up success and failure responses.
Unfortunately, there is a bug: I could not get 2 successive requests to work.
The mock server's TCP socket does not work when reading the second request.
If someone who knows the JDK socket classes could look at the mock server, I
would greatly appreciate it.
The alternative is to steal a bunch of files from the apache commons
httpclient test suite. This is a quite sophisticate bunch of code:
http://svn.apache.org/repos/asf/httpcomponents/oac.hc3x/trunk/src/test/org/apache/commons/httpclient/server/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2013-01-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13568150#comment-13568150
 ] 

Lance Norskog commented on LUCENE-2899:
---

Thank you. Have you tried this on the trunk? The Solr components did not work, 
they could not find the OpenNLP jars.


 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.2, 5.0

 Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2012-12-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13541285#comment-13541285
 ] 

Lance Norskog commented on LUCENE-2899:
---

Wow, someone tried it! I apologize for not noticing your question.

bq. I'm able to get the posTagger working, yet I still have not found a way to 
incorporate either the Chunker or the NER Models into my Solr project.

The schema.xml file includes samples for all of the models:

{{/lusolr_4x_opennlp/solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/schema.xml}}

This is for the chunker. The chunker works from parts-of-speech tags, not the 
original words. The chunker needs a parts-of-speech model as well as a chunker 
model. This should throw an error if the parts-of-speech model is not there. I 
will fix this.

{code:xml}
 filter class=solr.OpenNLPFilterFactory 
  posTaggerModel=opennlp/en-test-pos-maxent.bin
  chunkerModel=opennlp/en-test-chunker.bin
/
{code}

Is the NER configuration still not working?


 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 4.1

 Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-28 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540735#comment-13540735
]

Lance Norskog commented on SOLR-1972:
-

bq. 2) Solr already includes mahout-math 0.3 as a dependency of carrot2.
I did not mean to suggest using the Mahout libraries. I would just copy the
class source code and change the weights. It has no other dependencies inside
the Mahout project.

Need additional query stats in admin interface - median, 95th and 99th
percentile
-

Key: SOLR-1972
URL: https://issues.apache.org/jira/browse/SOLR-1972
Project: Solr
Issue Type: Improvement
Components: web gui
Affects Versions: 1.4
Reporter: Shawn Heisey
Assignee: Alan Woodward
Priority: Minor
Fix For: 4.2, 5.0

Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch,
elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch,
leak.patch, revert-SOLR-1972.patch, SOLR-1972-branch3x-url_pattern.patch,
SOLR-1972-branch4x.patch, SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch,
SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch,
SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch,
SOLR-1972_metrics.patch, solr1972-metricsregistry-branch4x-failure.log,
SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch,
SOLR-1972-url_pattern.patch, stacktraces.tar.gz

I would like to see more detailed query statistics from the admin GUI. This
is what you can get now:
requests : 809
errors : 0
timeouts : 0
totalTime : 70053
avgTimePerRequest : 86.59209
avgRequestsPerSecond : 0.8148785
I'd like to see more data on the time per request - median, 95th percentile,
99th percentile, and any other statistical function that makes sense to
include. In my environment, the first bunch of queries after startup tend to
take several seconds each. I find that the average value tends to be useless
until it has several thousand queries under its belt and the caches are
thoroughly warmed. The statistical functions I have mentioned would quickly
eliminate the influence of those initial slow queries.
The system will have to store individual data about each query. I don't know
if this is something Solr does already. It would be nice to have a
configurable count of how many of the most recent data points are kept, to
control the amount of memory the feature uses. The default value could be
something like 1024 or 4096.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540200#comment-13540200
 ] 

Lance Norskog commented on SOLR-1972:
-

The 25/75 values come from weights, and can be changed to 99/95. I have a patch 
for that but never submitted it.

 Need additional query stats in admin interface - median, 95th and 99th 
 percentile
 -

 Key: SOLR-1972
 URL: https://issues.apache.org/jira/browse/SOLR-1972
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Affects Versions: 1.4
Reporter: Shawn Heisey
Assignee: Alan Woodward
Priority: Minor
 Fix For: 4.2, 5.0

 Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
 elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch, 
 leak.patch, revert-SOLR-1972.patch, SOLR-1972-branch3x-url_pattern.patch, 
 SOLR-1972-branch4x.patch, SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, solr1972-metricsregistry-branch4x-failure.log, 
 SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
 SOLR-1972-url_pattern.patch, stacktraces.tar.gz


 I would like to see more detailed query statistics from the admin GUI.  This 
 is what you can get now:
 requests : 809
 errors : 0
 timeouts : 0
 totalTime : 70053
 avgTimePerRequest : 86.59209
 avgRequestsPerSecond : 0.8148785 
 I'd like to see more data on the time per request - median, 95th percentile, 
 99th percentile, and any other statistical function that makes sense to 
 include.  In my environment, the first bunch of queries after startup tend to 
 take several seconds each.  I find that the average value tends to be useless 
 until it has several thousand queries under its belt and the caches are 
 thoroughly warmed.  The statistical functions I have mentioned would quickly 
 eliminate the influence of those initial slow queries.
 The system will have to store individual data about each query.  I don't know 
 if this is something Solr does already.  It would be nice to have a 
 configurable count of how many of the most recent data points are kept, to 
 control the amount of memory the feature uses.  The default value could be 
 something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3413) CombiningFilter to recombine tokens into a single token for sorting

2012-12-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13540206#comment-13540206
 ] 

Lance Norskog commented on LUCENE-3413:
---

For sorting, would you want 'grapes_of_wrath? This distinguishes the word 
'grapes' from words that might start with 'grapes'. (I don't know of any, but 
you see the problem :)

Also, in this use case numerical canonicalization makes sense for searching and 
sorting. Twenty-two - 22, and also 'twenty two' - 22. Or maybe 'twenty two' 
- 'twenty-two'.



 CombiningFilter to recombine tokens into a single token for sorting
 ---

 Key: LUCENE-3413
 URL: https://issues.apache.org/jira/browse/LUCENE-3413
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 2.9.3
Reporter: Chris A. Mattmann
Priority: Minor
 Attachments: LUCENE-3413.Mattmann.090311.patch.txt, 
 LUCENE-3413.Mattmann.090511.patch.txt


 I whipped up this CombiningFilter for the following use case:
 I've got a bunch of titles of e.g., Books, such as:
 The Grapes of Wrath
 Tommy Tommerson saves the World
 Top of the World
 The Tales of Beedle the Bard
 Born Free
 etc.
 I want to sort these titles using a String field that includes stopword 
 analysis (e.g., to remove The), and synonym filtering (e.g., for grouping), 
 etc. I created an analysis chain in Solr for this that was based off of 
 *alphaOnlySort*, which looks like this:
 {code:xml}
 fieldType name=alphaOnlySort class=solr.TextField sortMissingLast=true 
 omitNorms=true
analyzer
 !-- KeywordTokenizer does no actual tokenizing, so the entire
  input string is preserved as a single token
   --
 tokenizer class=solr.KeywordTokenizerFactory/
 !-- The LowerCase TokenFilter does what you expect, which can be
  when you want your sorting to be case insensitive
   --
 filter class=solr.LowerCaseFilterFactory /
 !-- The TrimFilter removes any leading or trailing whitespace --
 filter class=solr.TrimFilterFactory /
 !-- The PatternReplaceFilter gives you the flexibility to use
  Java Regular expression to replace any sequence of characters
  matching a pattern with an arbitrary replacement string, 
  which may include back references to portions of the original
  string matched by the pattern.
  
  See the Java Regular Expression documentation for more
  information on pattern and replacement string syntax.
  
  
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
   --
 filter class=solr.PatternReplaceFilterFactory
 pattern=([^a-z]) replacement= replace=all
 / 
 /analyzer   
 /fieldType
 {code}
 The issue with alphaOnlySort is that it doesn't support stopword remove or 
 synonyms because those are based on the original token level instead of the 
 full strings produced by the KeywordTokenizer (which does not do 
 tokenization). I needed a filter that would allow me to change alphaOnlySort 
 and its analysis chain from using KeywordTokenizer to using 
 WhitespaceTokenizer, and then a way to recombine the tokens at the end. So, 
 take The Grapes of Wrath. I needed a way for it to get turned into:
 {noformat}
 grapes of wrath
 {noformat}
 And then to combine those tokens into a single token:
 {noformat}
 grapesofwrath
 {noformat}
 The attached CombiningFilter takes care of that. It doesn't do it super 
 efficiently I'm guessing (since I used a StringBuffer), but I'm open to 
 suggestions on how to make it better. 
 One other thing is that apparently this analyzer works fine for analysis 
 (e.g., it produces the desired tokens), however, for sorting in Solr I'm 
 getting null sort tokens. Need to figure out why. 
 Here ya go!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-12-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13539779#comment-13539779
 ] 

Lance Norskog commented on SOLR-1972:
-

The OnlineSummary class in Mahout does the calculations you want. One little 
class you can steal. No dependencies necessary.

 Need additional query stats in admin interface - median, 95th and 99th 
 percentile
 -

 Key: SOLR-1972
 URL: https://issues.apache.org/jira/browse/SOLR-1972
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Affects Versions: 1.4
Reporter: Shawn Heisey
Assignee: Alan Woodward
Priority: Minor
 Fix For: 4.1, 5.0

 Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
 elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, leak-closeable.patch, 
 leak.patch, SOLR-1972-branch3x-url_pattern.patch, SOLR-1972-branch4x.patch, 
 SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 solr1972-metricsregistry-branch4x-failure.log, SOLR-1972.patch, 
 SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, 
 SOLR-1972-url_pattern.patch, stacktraces.tar.gz


 I would like to see more detailed query statistics from the admin GUI.  This 
 is what you can get now:
 requests : 809
 errors : 0
 timeouts : 0
 totalTime : 70053
 avgTimePerRequest : 86.59209
 avgRequestsPerSecond : 0.8148785 
 I'd like to see more data on the time per request - median, 95th percentile, 
 99th percentile, and any other statistical function that makes sense to 
 include.  In my environment, the first bunch of queries after startup tend to 
 take several seconds each.  I find that the average value tends to be useless 
 until it has several thousand queries under its belt and the caches are 
 thoroughly warmed.  The statistical functions I have mentioned would quickly 
 eliminate the influence of those initial slow queries.
 The system will have to store individual data about each query.  I don't know 
 if this is something Solr does already.  It would be nice to have a 
 configurable count of how many of the most recent data points are kept, to 
 control the amount of memory the feature uses.  The default value could be 
 something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4164) Result Grouping fails if no hits

2012-12-16 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13533675#comment-13533675
 ] 

Lance Norskog commented on SOLR-4164:
-

I can't recreate it. It may have been another problem I was having: a shard 
server ran out of memory during the query and threw an exception to the 
distributor. Maybe the group query collection code ignores these remote 
exceptions?

 Result Grouping fails if no hits
 

 Key: SOLR-4164
 URL: https://issues.apache.org/jira/browse/SOLR-4164
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other, SolrCloud
Affects Versions: 4.0
Reporter: Lance Norskog

 In SolrCloud, found a result grouping bug in the 4.0 release.
 A distributed result grouping request under SolrCloud got this result:
 {noformat}
 Dec 10, 2012 10:32:07 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.IllegalArgumentException: numHits must be  0; please 
 use TotalHitCountCollector if you just need the total hit count
 at 
 org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1120)
 at 
 org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1069)
 at 
 org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector.init(AbstractSecondPassGroupingCollector.java:75)
 at 
 org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector.init(TermSecondPassGroupingCollector.java:49)
 at 
 org.apache.solr.search.grouping.distributed.command.TopGroupsFieldCommand.create(TopGroupsFieldCommand.java:128)
 at 
 org.apache.solr.search.grouping.CommandHandler.execute(CommandHandler.java:132)
 at 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:339)
 at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4164) Result Grouping fails if no hits

2012-12-10 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4164:
---

 Summary: Result Grouping fails if no hits
 Key: SOLR-4164
 URL: https://issues.apache.org/jira/browse/SOLR-4164
 Project: Solr
  Issue Type: Bug
  Components: SearchComponents - other, SolrCloud
Affects Versions: 4.0
Reporter: Lance Norskog


In SolrCloud, found a result grouping bug in the 4.0 release.
A distributed result grouping request under SolrCloud got this result:

{noformat}
Dec 10, 2012 10:32:07 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.IllegalArgumentException: numHits must be  0; please 
use TotalHitCountCollector if you just need the total hit count
at 
org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1120)
at 
org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:1069)
at 
org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector.init(AbstractSecondPassGroupingCollector.java:75)
at 
org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector.init(TermSecondPassGroupingCollector.java:49)
at 
org.apache.solr.search.grouping.distributed.command.TopGroupsFieldCommand.create(TopGroupsFieldCommand.java:128)
at 
org.apache.solr.search.grouping.CommandHandler.execute(CommandHandler.java:132)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:339)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:206)
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4150) NPE in distributed result grouping if group.query has no results

2012-12-05 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4150:
---

 Summary: NPE in distributed result grouping if group.query has no 
results
 Key: SOLR-4150
 URL: https://issues.apache.org/jira/browse/SOLR-4150
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 4.0
Reporter: Lance Norskog


If group.query has no results in a distributed search, there is an NPE in the 
front-end:
{noformat}
Dec 5, 2012 10:40:31 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/select 
params={debugQuery=truegroup.ngroups=truefl=thing,eventidindent=trueq=thing:(CODE:20517)group.field=eventidgroup.query=thing:CODE*group=truewt=jsonfq=source:somewhere}
 status=500 QTime=745 
Dec 5, 2012 10:40:31 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.NullPointerException
at 
org.apache.solr.search.grouping.distributed.shardresultserializer.TopGroupsResultTransformer.transformToNative(TopGroupsResultTransformer.java:110)
at 
org.apache.solr.search.grouping.distributed.responseprocessor.TopGroupsShardResponseProcessor.process(TopGroupsShardResponseProcessor.java:80)
at 
org.apache.solr.handler.component.QueryComponent.handleGroupedResponses(QueryComponent.java:620)
at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:603)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:309)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
at java.lang.Thread.run(Thread.java:662)
{noformat}

(This is in sharding, maybe not a SolrCloud problem.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4041) Allow segment merge monitoring in Solr Admin gui

2012-11-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13505129#comment-13505129
 ] 

Lance Norskog commented on SOLR-4041:
-

Cool! I have done monitoring of segment sizes with fixed-time polling, and 
post-commit polling of the data/index directory. This makes it easier to chart 
other aspects of merging. Another useful number is the current number of 
segments.

 Allow segment merge monitoring in Solr Admin gui
 

 Key: SOLR-4041
 URL: https://issues.apache.org/jira/browse/SOLR-4041
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Radim Kolar
Assignee: Mark Miller
Priority: Minor
  Labels: patch
 Fix For: 4.1, 5.0

 Attachments: solr-monitormerge.txt


 add solrMbean for ConcurrentMergeScheduler

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1306) Support pluggable persistence/loading of solr.xml details

2012-11-13 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496844#comment-13496844
 ] 

Lance Norskog commented on SOLR-1306:
-

bq.  think we should drop the top level config (eg solr.xml). Instead, we 
should auto load folders 
+1 

There are often groups of cores with the same schema- shards in the same solr, 
for example. How would this dynamic discovery support groups of collections?



 Support pluggable persistence/loading of solr.xml details
 -

 Key: SOLR-1306
 URL: https://issues.apache.org/jira/browse/SOLR-1306
 Project: Solr
  Issue Type: New Feature
  Components: multicore
Reporter: Noble Paul
Assignee: Erick Erickson
 Fix For: 4.1

 Attachments: SOLR-1306.patch, SOLR-1306.patch, SOLR-1306.patch, 
 SOLR-1306.patch


 Persisting and loading details from one xml is fine if the no:of cores are 
 small and the no:of cores are few/fixed . If there are 10's of thousands of 
 cores in a single box adding a new core (with persistent=true) becomes very 
 expensive because every core creation has to write this huge xml. 
 Moreover , there is a good chance that the file gets corrupted and all the 
 cores become unusable . In that case I would prefer it to be stored in a 
 centralized DB which is backed up/replicated and all the information is 
 available in a centralized location. 
 We may need to refactor CoreContainer to have a pluggable implementation 
 which can load/persist the details . The default implementation should 
 write/read from/to solr.xml . And the class should be pluggable as follows in 
 solr.xml
 {code:xml}
 solr
   dataProvider class=com.foo.FooDataProvider attr1=val1 attr2=val2/
 /solr
 {code}
 There will be a new interface (or abstract class ) called SolrDataProvider 
 which this class must implement

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4007) Morfologik dictionaries not available in Solr field type

2012-10-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487627#comment-13487627
 ] 

Lance Norskog commented on SOLR-4007:
-

What is the change? I would like to change my OpenNLP patch to work in the same 
directory/jar structure.

 Morfologik dictionaries not available in Solr field type
 

 Key: SOLR-4007
 URL: https://issues.apache.org/jira/browse/SOLR-4007
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.1
Reporter: Lance Norskog
Assignee: Dawid Weiss
Priority: Minor
 Fix For: 4.1


 The Polish Morfologik type does not find its dictionaries when used in Solr. 
 To demonstrate:
 1) Add this to example/solr/collection1/conf/schema.xml:
 {noformat}
 !-- Polish --
 fieldType name=text_pl class=solr.TextField 
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.MorfologikFilterFactory dictionary=MORFOLOGIK 
 /
   /analyzer
 /fieldType
 {noformat}
 2) Add this to example/solr/collection1/conf/solrconfig.xml:
 {noformat}
   lib dir=../../../../lucene/build/analysis/morfologik/ regex=.*\.jar /
   lib dir=../../../contrib/analysis-extras/lib regex=.*\.jar /
   lib dir=../../../dist/ regex=apache-solr-analysis-extras-\d.*\.jar /
 {noformat}
 3) Test 'text_pl' in the analysis page. You will get an exception.
 {noformat}
 Oct 28, 2012 8:27:19 PM org.apache.solr.core.SolrCore execute
 INFO: [collection1] webapp=/solr path=/analysis/field 
 params={analysis.showmatch=trueanalysis.query=wt=jsonanalysis.fieldvalue=blah+blahanalysis.fieldtype=text_pl}
  status=500 QTime=26 
 Oct 28, 2012 8:27:19 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.RuntimeException: Default dictionary resource for 
 language 'plnot found.
   at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:163)
   at morfologik.stemming.PolishStemmer.init(PolishStemmer.java:64)
   at 
 org.apache.lucene.analysis.morfologik.MorfologikFilter.init(MorfologikFilter.java:70)
   at 
 org.apache.lucene.analysis.morfologik.MorfologikFilterFactory.create(MorfologikFilterFactory.java:63)
   at 
 org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:125)
   at 
 org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
   at 
 org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
   at 
 org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:100)
   at 
 [...]
 Caused by: java.io.IOException: Could not locate resource: 
 morfologik/dictionaries/pl.dict
   at morfologik.util.ResourceUtils.openInputStream(ResourceUtils.java:56)
   at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:156)
   ... 38 more
 {noformat}
 {{morfologik-polish-1.5.3.jar}} has {{morfologik/dictionaries/pl.dict}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1487) Add expungeDelete to SolrJ's SolrServer.commit(..)

2012-10-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13488406#comment-13488406
 ] 

Lance Norskog commented on SOLR-1487:
-

a) They both require changes to the same path, so should probably be in one 
commit.
b) SOLR-3938 has a unit test, while this does not. It is really easy for this 
kind of feature to stop working. The SolrJ code paths for 
commit/rollback/prepareCommit etc. need unit tests. 



 Add  expungeDelete to SolrJ's SolrServer.commit(..)
 ---

 Key: SOLR-1487
 URL: https://issues.apache.org/jira/browse/SOLR-1487
 Project: Solr
  Issue Type: Improvement
  Components: clients - java
Affects Versions: 1.3
 Environment: N/A
Reporter: Jibo John
 Attachments: expunge-patch.txt


 Add  expungeDelete to SolrJ's SolrServer.commit(..).
 Currently, this can be done only through updatehandler (  ( curl update -F 
 stream.body=' commit expungeDeletes=true/' )) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1487) Add expungeDelete to SolrJ's SolrServer.commit(..)

2012-10-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13487421#comment-13487421
 ] 

Lance Norskog commented on SOLR-1487:
-

[SOLR-3938]-unit.patch adds prepareCommit().

A question to the experts: what is a good unit test to enhance for this? It 
needs to check numDoc v.s. maxDoc, to the test would be one that add docs and 
then reads back the stats.


 Add  expungeDelete to SolrJ's SolrServer.commit(..)
 ---

 Key: SOLR-1487
 URL: https://issues.apache.org/jira/browse/SOLR-1487
 Project: Solr
  Issue Type: Improvement
  Components: clients - java
Affects Versions: 1.3
 Environment: N/A
Reporter: Jibo John
 Attachments: expunge-patch.txt


 Add  expungeDelete to SolrJ's SolrServer.commit(..).
 Currently, this can be done only through updatehandler (  ( curl update -F 
 stream.body=' commit expungeDeletes=true/' )) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1972) Need additional query stats in admin interface - median, 95th and 99th percentile

2012-10-29 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13486272#comment-13486272
 ] 

Lance Norskog commented on SOLR-1972:
-

The 5th percentile is really useful. There is always a maximum query time of 
30s just because of a garbage collection failure, and people look at that 
number and freak out. For query times, the 5th percentile shows what is 
repeatedly too slow. 

 Need additional query stats in admin interface - median, 95th and 99th 
 percentile
 -

 Key: SOLR-1972
 URL: https://issues.apache.org/jira/browse/SOLR-1972
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Affects Versions: 1.4
Reporter: Shawn Heisey
Assignee: Alan Woodward
Priority: Minor
 Fix For: 4.1

 Attachments: elyograg-1972-3.2.patch, elyograg-1972-3.2.patch, 
 elyograg-1972-trunk.patch, elyograg-1972-trunk.patch, 
 SOLR-1972-branch3x-url_pattern.patch, SOLR-1972-branch4x.patch, 
 SOLR-1972-branch4x.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 SOLR-1972_metrics.patch, SOLR-1972_metrics.patch, 
 solr1972-metricsregistry-branch4x-failure.log, SOLR-1972.patch, 
 SOLR-1972.patch, SOLR-1972.patch, SOLR-1972.patch, SOLR-1972-url_pattern.patch


 I would like to see more detailed query statistics from the admin GUI.  This 
 is what you can get now:
 requests : 809
 errors : 0
 timeouts : 0
 totalTime : 70053
 avgTimePerRequest : 86.59209
 avgRequestsPerSecond : 0.8148785 
 I'd like to see more data on the time per request - median, 95th percentile, 
 99th percentile, and any other statistical function that makes sense to 
 include.  In my environment, the first bunch of queries after startup tend to 
 take several seconds each.  I find that the average value tends to be useless 
 until it has several thousand queries under its belt and the caches are 
 thoroughly warmed.  The statistical functions I have mentioned would quickly 
 eliminate the influence of those initial slow queries.
 The system will have to store individual data about each query.  I don't know 
 if this is something Solr does already.  It would be nice to have a 
 configurable count of how many of the most recent data points are kept, to 
 control the amount of memory the feature uses.  The default value could be 
 something like 1024 or 4096.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-4007) Morfologik dictionaries not available in Solr field type

2012-10-28 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-4007:
---

 Summary: Morfologik dictionaries not available in Solr field type
 Key: SOLR-4007
 URL: https://issues.apache.org/jira/browse/SOLR-4007
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.1
Reporter: Lance Norskog
Priority: Minor


The Polish Morfologik type does not find its dictionaries when used in Solr. To 
demonstrate:

1) Add this to example/solr/collection1/conf/schema.xml:
{noformat}
!-- Polish --
fieldType name=text_pl class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.MorfologikFilterFactory dictionary=MORFOLOGIK /
  /analyzer
/fieldType
{noformat}

2) Add this to example/solr/collection1/conf/solrconfig.xml:

{noformat}
  lib dir=../../../../lucene/build/analysis/morfologik/ regex=.*\.jar /
  lib dir=../../../contrib/analysis-extras/lib regex=.*\.jar /
  lib dir=../../../dist/ regex=apache-solr-analysis-extras-\d.*\.jar /
{noformat}

3) Test 'text_pl' in the analysis page. You will get an exception.
{noformat}
Oct 28, 2012 8:27:19 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/analysis/field 
params={analysis.showmatch=trueanalysis.query=wt=jsonanalysis.fieldvalue=blah+blahanalysis.fieldtype=text_pl}
 status=500 QTime=26 
Oct 28, 2012 8:27:19 PM org.apache.solr.common.SolrException log
SEVERE: null:java.lang.RuntimeException: Default dictionary resource for 
language 'plnot found.
at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:163)
at morfologik.stemming.PolishStemmer.init(PolishStemmer.java:64)
at 
org.apache.lucene.analysis.morfologik.MorfologikFilter.init(MorfologikFilter.java:70)
at 
org.apache.lucene.analysis.morfologik.MorfologikFilterFactory.create(MorfologikFilterFactory.java:63)
at 
org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:125)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:220)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:181)
at 
org.apache.solr.handler.FieldAnalysisRequestHandler.doAnalysis(FieldAnalysisRequestHandler.java:100)
at 

[...]

Caused by: java.io.IOException: Could not locate resource: 
morfologik/dictionaries/pl.dict
at morfologik.util.ResourceUtils.openInputStream(ResourceUtils.java:56)
at morfologik.stemming.Dictionary.getForLanguage(Dictionary.java:156)
... 38 more

{noformat}

{{morfologik-polish-1.5.3.jar}} has {{morfologik/dictionaries/pl.dict}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2216) Highlighter query exceeds maxBooleanClause limit due to range query

2012-10-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13485241#comment-13485241
 ] 

Lance Norskog commented on SOLR-2216:
-

Is this still a problem in 3.6, 4.0 or the trunk?

 Highlighter query exceeds maxBooleanClause limit due to range query
 ---

 Key: SOLR-2216
 URL: https://issues.apache.org/jira/browse/SOLR-2216
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.4.1
 Environment: Linux solr-2.bizjournals.int 2.6.18-194.3.1.el5 #1 SMP 
 Thu May 13 13:08:30 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
 java version 1.6.0_21
 Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
 Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
 JAVA_OPTS=-client -Dcom.sun.management.jmxremote=true 
 -Dcom.sun.management.jmxremote.port= 
 -Dcom.sun.management.jmxremote.authenticate=true 
 -Dcom.sun.management.jmxremote.access.file=/root/.jmxaccess 
 -Dcom.sun.management.jmxremote.password.file=/root/.jmxpasswd 
 -Dcom.sun.management.jmxremote.ssl=false -XX:+UseCompressedOops 
 -XX:MaxPermSize=512M -Xms10240M -Xmx15360M -XX:+UseParallelGC 
 -XX:+AggressiveOpts -XX:NewRatio=5
 top - 11:38:49 up 124 days, 22:37,  1 user,  load average: 5.20, 4.35, 3.90
 Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
 Cpu(s): 47.5%us,  2.9%sy,  0.0%ni, 49.5%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
 Mem:  24679008k total, 18179980k used,  6499028k free,   125424k buffers
 Swap: 26738680k total,29276k used, 26709404k free,  8187444k cached
Reporter: Ken Stanley

 For a full detail of the issue, please see the mailing list: 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201011.mbox/%3CAANLkTimE8z8yOni+u0Nsbgct1=ef7e+su0_waku2c...@mail.gmail.com%3E
 The nutshell version of the issue is that when I have a query that contains 
 ranges on a specific (non-highlighted) field, the highlighter component is 
 attempting to create a query that exceeds the value of maxBooleanClauses set 
 from solrconfig.xml. This is despite my explicit setting of hl.field, 
 hl.requireFieldMatch, and various other hightlight options in the query. 
 As suggested by Koji in the follow-up response, I removed the range queries 
 from my main query, and SOLR and highlighting were happy to fulfill my 
 request. It was suggested that if removing the range queries worked that this 
 might potentially be a bug, hence my filing this JIRA ticket. For what it is 
 worth, if I move my range queries into an fq, I do not get the exception 
 about exceeding maxBooleanClauses, and I get the effect that I was looking 
 for. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3938) prepareCommit command omits commitData

2012-10-26 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3938:


Attachment: SOLR-3938-unit.patch

Add unit test to TestReplicationHandler. This requires solrj support for 
prepareCommit, and thus includes that. 

 prepareCommit command omits commitData
 --

 Key: SOLR-3938
 URL: https://issues.apache.org/jira/browse/SOLR-3938
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.0
Reporter: Yonik Seeley
  Labels: 4.0.1_Candidate
 Fix For: 4.1

 Attachments: SOLR-3938.patch, SOLR-3938-unit.patch


 Solr's prepareCommit doesn't set any commitData, and then when a commit is 
 done, it's too late.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-25 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484559#comment-13484559
]

Lance Norskog commented on SOLR-3975:
-

It's a first draft, not ready for committing. It needs strategies for
controlling processing time, and code cleanups. I wanted to get it out for
review before sinking even more time into it.

Document Summarization toolkit, using LSA techniques

Key: SOLR-3975
URL: https://issues.apache.org/jira/browse/SOLR-3975
Project: Solr
Issue Type: New Feature
Reporter: Lance Norskog
Priority: Minor
Attachments: 4.1.summary.patch, reuters.sh

This package analyzes sentences and words as used across sentences to rank
the most important sentences and words. The general topic is called document
summarization and is a popular research topic in textual analysis.
How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.
Now go to http://localhost:8983/solr/collection1/browse?summary=true and look
at the large gray box marked 'Document Summary'. This has a table of
statistics about the analysis, the three most important sentences, and
several of the most important words in the documents. The sentences have the
important words in italics.
The code is packaged as a search component and as an analysis handler. The
/browse demo uses the search component, and you can also post raw text to
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample
command:
{code}
curl -s
http://localhost:8983/solr/analysis/summary?indent=trueechoParams=allfile=$FILEwt=xml;
--data-binary @$FILE -H 'Content-type:application/xml'
{code}
This is an implementation of LSA-based document summarization. A short
explanation and a long evaluation are described in my blog, [Uncle Lance's
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here:
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-3975:
---

 Summary: Document Summarization toolkit, using LSA techniques
 Key: SOLR-3975
 URL: https://issues.apache.org/jira/browse/SOLR-3975
 Project: Solr
  Issue Type: New Feature
Reporter: Lance Norskog
Priority: Minor
 Attachments: 4.1.summary.patch, reuters.sh

This package analyzes sentences and words as used across sentences to rank the 
most important sentences and words. The general topic is called document 
summarization and is a popular research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
at the large gray box marked 'Document Summary'. This has a table of statistics 
about the analysis, the three most important sentences, and several of the most 
important words in the documents. The sentences have the important tags in 
italics.

The code is packaged as a search component and as an analysis handler. The 
/browse demo uses the search component, and you can also post raw text to  
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
command:
curl -s 
http://localhost:8983/solr/analysis/summary?indent=trueechoParams=allfile=$FILEwt=xml;
 --data-binary @$FILE -H 'Content-type:application/xml'

This is an implementation of LSA-based document summarization. A short 
explanation and a long evaluation are described in my blog, [Uncle Lance's 
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3975:


Attachment: reuters.sh
4.1.summary.patch

 Document Summarization toolkit, using LSA techniques
 

 Key: SOLR-3975
 URL: https://issues.apache.org/jira/browse/SOLR-3975
 Project: Solr
  Issue Type: New Feature
Reporter: Lance Norskog
Priority: Minor
 Attachments: 4.1.summary.patch, reuters.sh


 This package analyzes sentences and words as used across sentences to rank 
 the most important sentences and words. The general topic is called document 
 summarization and is a popular research topic in textual analysis. 
 How to use:
 1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
 instance.
 2) Download the first Reuters article corpus from:
 http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
 3) Unpack this into a directory.
 4) Run the attached 'reuters.sh' script:
 sh reuters.sh directory http://localhost:8983/solr/collection1
 5) Wait several minutes.
 Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
 at the large gray box marked 'Document Summary'. This has a table of 
 statistics about the analysis, the three most important sentences, and 
 several of the most important words in the documents. The sentences have the 
 important tags in italics.
 The code is packaged as a search component and as an analysis handler. The 
 /browse demo uses the search component, and you can also post raw text to  
 http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
 command:
 curl -s 
 http://localhost:8983/solr/analysis/summary?indent=trueechoParams=allfile=$FILEwt=xml;
  --data-binary @$FILE -H 'Content-type:application/xml'
 This is an implementation of LSA-based document summarization. A short 
 explanation and a long evaluation are described in my blog, [Uncle Lance's 
 Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
 [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3975) Document Summarization toolkit, using LSA techniques

2012-10-22 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lance Norskog updated SOLR-3975:

Description:
This package analyzes sentences and words as used across sentences to rank the
most important sentences and words. The general topic is called document
summarization and is a popular research topic in textual analysis.

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look
at the large gray box marked 'Document Summary'. This has a table of statistics
about the analysis, the three most important sentences, and several of the most
important words in the documents. The sentences have the important words in
italics.

The code is packaged as a search component and as an analysis handler. The
/browse demo uses the search component, and you can also post raw text to
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample
command:
{code}
curl -s
http://localhost:8983/solr/analysis/summary?indent=trueechoParams=allfile=$FILEwt=xml;
--data-binary @$FILE -H 'Content-type:application/xml'
{code}

This is an implementation of LSA-based document summarization. A short
explanation and a long evaluation are described in my blog, [Uncle Lance's
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here:
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

was:
This package analyzes sentences and words as used across sentences to rank the
most important sentences and words. The general topic is called document
summarization and is a popular research topic in textual analysis.

The code is packaged as a search component and as an analysis handler. The
/browse demo uses the search component, and you can also post raw text to
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample
command:
curl -s
http://localhost:8983/solr/analysis/summary?indent=trueechoParams=allfile=$FILEwt=xml;
--data-binary @$FILE -H 'Content-type:application/xml'

Document Summarization toolkit, using LSA techniques

Key: SOLR-3975
URL: https://issues.apache.org/jira/browse/SOLR-3975
Project: Solr
Issue Type: New Feature
Reporter: Lance Norskog
Priority: Minor
Attachments: 4.1.summary.patch, reuters.sh

[jira] [Commented] (LUCENE-4494) Add phoenetic algorithm Match Rating approach to lucene

2012-10-19 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480615#comment-13480615
 ] 

Lance Norskog commented on LUCENE-4494:
---

Cool! Is it this algorithm? 
[http://en.wikipedia.org/wiki/Match_rating_approach]



 Add phoenetic algorithm Match Rating approach to lucene
 ---

 Key: LUCENE-4494
 URL: https://issues.apache.org/jira/browse/LUCENE-4494
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Colm Rice
Priority: Minor
 Fix For: 4.1

   Original Estimate: 168h
  Remaining Estimate: 168h

 I want to add MatchRatingApproach algorithm to the Lucene project. 
 What I have at the moment is a class called 
 org.apache.lucene.analysis.phoenetic.MatchRatingApproach implementing 
 StringEncoder
 I have a pretty comprehensive test file located at: 
 org.apache.lucene.analysis.phonetic.MatchRatingApproachTests
 It's not exactly existing pattern so I'm going to need a bit of advice here. 
 Thanks! Feel free to email.
 FYI: It my first contribitution so be gentle :-) C# is my native.
 Reference: http://en.wikipedia.org/wiki/Match_rating_approach

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471117#comment-13471117
]

Lance Norskog commented on LUCENE-3922:
---

bq. On the other hand, I agree with Christian to not preserving leading zeros.
So, ◯◯七 doesn't need to become 007.
This example shows why leading zeros should be preserved :)

There are different kinds of text search. Searching for media titles like James
Bond movies is a very different thing from searching newspaper articles. You
might want to find ◯◯七 as the Japanese-language release and 007 as the
English-language release. These numbers are brands, not numbers.

Add Japanese Kanji number normalization to Kuromoji
---

Key: LUCENE-3922
URL: https://issues.apache.org/jira/browse/LUCENE-3922
Project: Lucene - Core
Issue Type: New Feature
Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
Labels: features
Attachments: LUCENE-3922.patch

Japanese people use Kanji numerals instead of Arabic numerals for writing
price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and
十二月(December). So, we would like to normalize those Kanji numerals to Arabic
numerals (I don't think we need to have a capability to normalize to Kanji
numerals).

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471121#comment-13471121
]

Lance Norskog commented on LUCENE-3921:
---

Another way to look at this is that Smart Chinese and Kuromoji are systems for
minimizing bogus bigrams. This allows phrase queries to function without
finding bogus results. The CJK bigram creator generates bogus bigrams, which
cause phrase queries to find bogus results. [SOLR-3653] is the result of my
experience in supporting searching Chinese legal documents. I have some useful
numbers at the end of the page.

Add decompose compound Japanese Katakana token capability to Kuromoji
-

Key: LUCENE-3921
URL: https://issues.apache.org/jira/browse/LUCENE-3921
Project: Lucene - Core
Issue Type: Improvement
Components: modules/analysis
Affects Versions: 4.0-ALPHA
Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
Labels: features

Japanese morphological analyzer, Kuromoji doesn't have a capability to
decompose every Japanese Katakana compound tokens to sub-tokens. It seems
that some Katakana tokens can be decomposed, but it cannot be applied every
Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ
don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary
has バッグ in its entry. I would like to apply the decompose feature to every
Katakana tokens if the sub-tokens are in the dictionary or add the capability
to force apply the decompose feature to every Katakana tokens.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-06 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13471121#comment-13471121
]

Lance Norskog edited comment on LUCENE-3921 at 10/7/12 12:33 AM:
-

Statistical models and rule-based models always have a failure rate. When you
use them you have to decide what to do about the failures. Attacking the
failures with another model drives toward Xeno's Paradox. For Chinese language
search, breaking the failures into bigrams makes a lot of sense. The CJK bigram
generator creates a massive amount of bogus bigrams. Bogus bigrams case bogus
results from sloppy phrase searches.

Smart Chinese and Kuromoji are not systems for doing natural-language
processing). They are systems for minimizing bogus bigrams. This allows sloppy
phrase queries to find fewer bogus results. In my use case, Smart Chinese
created only 2% (40k/1.8m) of the possible bigrams. [SOLR-3653] is the result
of my experience in supporting searching Chinese legal documents. I have some
useful numbers at the end of the page.

was (Author: lancenorskog):
Statistical models and rule-based models always have a failure rate. When
you use them you have to decide what to do about the failures. Attacking the
failures with another model drives toward Xeno's Paradox. For Chinese language
search, breaking the failures into bigrams makes a lot of sense.

Add decompose compound Japanese Katakana token capability to Kuromoji
-

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Closed] (SOLR-3760) Build packaging of complex contrib packages just plain does not work

2012-10-04 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lance Norskog closed SOLR-3760.
---

Resolution: Fixed
Fix Version/s: 4.0

The Solr factories have all been moved into Lucene, and so the zig-zag
dependency problem no longer exists. For the rest of the topic, some other time.

Build packaging of complex contrib packages just plain does not work

Key: SOLR-3760
URL: https://issues.apache.org/jira/browse/SOLR-3760
Project: Solr
Issue Type: Improvement
Components: Build
Reporter: Lance Norskog
Fix For: 4.0

The build system packages Lucene libraries in the Solr war, but they do not
pack libraries required by the Lucene libraries. The UIMA and analysis-extras
contrib packages have factories for the Lucene libraries.
The net effect is that when solrconfig.xml include lib directives for
dist/xxx-contribX-xxx.jar and solr/contrib/contribX/lib, this fails because
the lucene analyzer file inside the solr war cannot find the library files in
solr/contrib/contribX/lib because the classloader for the war does not find
the libraries from the lib directives.
Two alternative fixes are presented below.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3921) Add decompose compound Japanese Katakana token capability to Kuromoji

2012-10-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469934#comment-13469934
 ] 

Lance Norskog commented on LUCENE-3921:
---

I have discovered a similar problem with the Smart Chinese toolkit. Would the 
same approach work for both languages? Would it be worth solving this problem 
with a generic tool rather than language-specific?

 Add decompose compound Japanese Katakana token capability to Kuromoji
 -

 Key: LUCENE-3921
 URL: https://issues.apache.org/jira/browse/LUCENE-3921
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
 Environment: Cent OS 5, IPA Dictionary, Run with Search mdoe
Reporter: Kazuaki Hiraga
  Labels: features

 Japanese morphological analyzer, Kuromoji doesn't have a capability to 
 decompose every Japanese Katakana compound tokens to sub-tokens. It seems 
 that some Katakana tokens can be decomposed, but it cannot be applied every 
 Katakana compound tokens. For instance, トートバッグ(tote bag) and ショルダーバッグ 
 don't decompose into トート バッグ and ショルダー バッグ although the IPA dictionary 
 has バッグ in its entry.  I would like to apply the decompose feature to every 
 Katakana tokens if the sub-tokens are in the dictionary or add the capability 
 to force apply the decompose feature to every Katakana tokens.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3922) Add Japanese Kanji number normalization to Kuromoji

2012-10-04 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469936#comment-13469936
 ] 

Lance Norskog commented on LUCENE-3922:
---

Kazuaki, do have any comment on this fix?

 Add Japanese Kanji number normalization to Kuromoji
 ---

 Key: LUCENE-3922
 URL: https://issues.apache.org/jira/browse/LUCENE-3922
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Kazuaki Hiraga
  Labels: features
 Attachments: LUCENE-3922.patch


 Japanese people use Kanji numerals instead of Arabic numerals for writing 
 price, address and so on. i.e 12万4800円(124,800JPY), 二番町三ノ二(3-2 Nibancho) and 
 十二月(December).  So, we would like to normalize those Kanji numerals to Arabic 
 numerals (I don't think we need to have a capability to normalize to Kanji 
 numerals).
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2012-09-30 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13466565#comment-13466565
 ] 

Lance Norskog commented on LUCENE-2899:
---

Thank you!

This worked when I posted it. There have been many changes in 4.x and trunk 
since then. For example, all of the tokenizer and filter factories moved to 
Lucene from Solr. I'm waiting until 4.0 is finished before I redo this patch. 




 Add OpenNLP Analysis capabilities as a module
 -

 Key: LUCENE-2899
 URL: https://issues.apache.org/jira/browse/LUCENE-2899
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
 LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
 OpenNLPTokenizer.java, opennlp_trunk.patch


 Now that OpenNLP is an ASF project and has a nice license, it would be nice 
 to have a submodule (under analysis) that exposed capabilities for it. Drew 
 Farris, Tom Morton and I have code that does:
 * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
 would have to change slightly to buffer tokens)
 * NamedEntity recognition as a TokenFilter
 We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
 either payloads (PartOfSpeechAttribute?) on a token or at the same position.
 I'd propose it go under:
 modules/analysis/opennlp

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3898) Mouse-over help in Analysis Browser does nothing

2012-09-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464581#comment-13464581
 ] 

Lance Norskog commented on SOLR-3898:
-

This does not work in Safari. Oh well.

 Mouse-over help in Analysis Browser does nothing
 

 Key: SOLR-3898
 URL: https://issues.apache.org/jira/browse/SOLR-3898
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Lance Norskog
Priority: Minor
 Attachments: Screen-Shot-2012-09-27-at-9.55.22-AM.png


 The Analysis UI has a mouse-over question mark for the acronyms shows for 
 every stage in the analysis pipeline. Clicking on this does nothing.
 I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3898) Mouse-over help in Analysis Browser does nothing

2012-09-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465080#comment-13465080
 ] 

Lance Norskog commented on SOLR-3898:
-

I'm on Snow Leopard: Version 5.1.7 (6534.57.2). I do not know the versions or 
updates. I will close this.

 Mouse-over help in Analysis Browser does nothing
 

 Key: SOLR-3898
 URL: https://issues.apache.org/jira/browse/SOLR-3898
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Lance Norskog
Priority: Minor
 Attachments: Screen-Shot-2012-09-27-at-12.06.58-PM.png, 
 Screen-Shot-2012-09-27-at-9.55.22-AM.png


 The Analysis UI has a mouse-over question mark for the acronyms shows for 
 every stage in the analysis pipeline. Clicking on this does nothing.
 I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (SOLR-3898) Mouse-over help in Analysis Browser does nothing

2012-09-27 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog resolved SOLR-3898.
-

Resolution: Invalid

 Mouse-over help in Analysis Browser does nothing
 

 Key: SOLR-3898
 URL: https://issues.apache.org/jira/browse/SOLR-3898
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Lance Norskog
Priority: Minor
 Attachments: Screen-Shot-2012-09-27-at-12.06.58-PM.png, 
 Screen-Shot-2012-09-27-at-9.55.22-AM.png


 The Analysis UI has a mouse-over question mark for the acronyms shows for 
 every stage in the analysis pipeline. Clicking on this does nothing.
 I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3218) Range faceting support for CurrencyField

2012-09-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465322#comment-13465322
 ] 

Lance Norskog commented on SOLR-3218:
-

+1 for this feature. It makes the currency type 10x more compelling.

 Range faceting support for CurrencyField
 

 Key: SOLR-3218
 URL: https://issues.apache.org/jira/browse/SOLR-3218
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Jan Høydahl
 Fix For: 4.1

 Attachments: SOLR-3218-1.patch, SOLR-3218-2.patch, SOLR-3218.patch, 
 SOLR-3218.patch, SOLR-3218.patch


 Spinoff from SOLR-2202. Need to add range faceting capabilities for 
 CurrencyField

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3734) Forever loop in schema browser

2012-09-26 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464413#comment-13464413
 ] 

Lance Norskog commented on SOLR-3734:
-

Hi-

Yes, this works in branch_4x, using the schema I submitted. I do not have 
ability to test whether it handles exceptions well. When you are writing new 
analyzer components, it is helpful for the UI to say your code blew up.


 Forever loop in schema browser
 --

 Key: SOLR-3734
 URL: https://issues.apache.org/jira/browse/SOLR-3734
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis, web gui
Reporter: Lance Norskog
Assignee: Stefan Matheis (steffkes)
 Attachments: SOLR-3734.patch, SOLR-3734.patch, 
 SOLR-3734_schema_browser_blocks_solr_conf_dir.zip


 When I start Solr with the attached conf directory, and hit the Schema 
 Browser, the loading circle spins permanently. 
 I don't know if the problem is in the UI or in Solr. The UI does not display 
 the Ajax solr calls, and I don't have a debugging proxy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3898) Mouse-over help in Analysis Browser does nothing

2012-09-26 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-3898:
---

 Summary: Mouse-over help in Analysis Browser does nothing
 Key: SOLR-3898
 URL: https://issues.apache.org/jira/browse/SOLR-3898
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Lance Norskog
Priority: Minor


The Analysis UI has a mouse-over question mark for the acronyms shows for every 
stage in the analysis pipeline. Clicking on this does nothing.

I guess this on the 'todo' list?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases

2012-09-23 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461572#comment-13461572
]

Lance Norskog commented on SOLR-3653:
-

I ran some counts on a database of 300k Chinese legal documents. The index has
a unigram field based on the StandardAnalyzer, a bigram field based on the CJK
analyzer, and a Smart Chinese field. I pulled the terms for all of them and
filtered for Chinese ideograms only. These are text unigrams, with

* The unigram field had 55k terms.
* The bigram field had 1.8 million terms.
* The Smart Chinese field had 417k terms:
** unigrams: 9.6k
** bigrams: 40k
** trigrams: 14.6k
** four: 5.6k
** five: 300
** six: 70
** seven: 51
** eight: 19
** nine: 7
** ten: 2
** eleven: 3
** twelve: 2
** thirteen: 3

The 4+ ngrams are essentially parsing failures by the Smart Chinese tokenizer.
I have attached three Google Translate versions of the longer ngrams.
'translations_first_500.trigrams.txt' and 'translations_first_500.quad.txt' are
the most common 3-ideogram and 4-ideogram terms. They have a lot of phrases
which should have been split. 'translations_450.five2thirteen.txt' are 450
ngrams which are 5 ideograms or longer. The longer ones have a lot of formal
geographical names, government organization names and official propaganda
phrases, more as the length increases.

For this corpus, based the above breakdown and on other experience:
# CJK is a waste of disk space. Bigrams introduce a ton of noise.
# Unigrams might work well if you only do strict phrase searches. But searching
for A, B, and C separately when given ABC is useless.
# If you search for raw country names, Smart Chinese lets you down when the
document uses the formal name.

Smart Chinese really does need to be split into bigrams. To cut bigram noise, I
would take the database of bigrams that it generates, and then use these to
guide splitting 3+ grams into bigrams. That is, if it ever generates AB, then
the splitter turns ABCD into (AB CD). BC would be considered 'bigram noise'.
Similarly, if Smart Chinese generates EF, then DEFG would become (D EF G).

However, a good fallback would be to have two fields, Smart Chinese and
unigrams, with Smart Chinese boosted upwards and unigrams only with strict
phrase search. With a high term count, bigrams are not helpful. You might even
want to search Smart Chinese first, and then do unigram loose phrase search
only if the recall is too low or the user is unhappy with the Smart Chinese
results.

Custom bigramming filter for to handle Smart Chinese edge cases
---

The Smart Simplified Chinese toolkit in lucene/analysis/smartcn does not
work in some edge cases. It fails to split certain words which were not part
of the dictionary or training corpus.
This patch supplies a bigramming class to handle these occasional mistakes.
The algorithm creates bigrams out of all words longer than two ideograms.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases

2012-09-23 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated SOLR-3653:


Attachment: translations_450.five2thirteen.txt
translations_first_500.trigrams.txt
translations_first_500.quad.txt

 Custom bigramming filter for to handle Smart Chinese edge cases
 ---

 Key: SOLR-3653
 URL: https://issues.apache.org/jira/browse/SOLR-3653
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Reporter: Lance Norskog
 Attachments: SmartChineseType.pdf, SOLR-3653.patch, 
 translations_450.five2thirteen.txt, translations_first_500.quad.txt, 
 translations_first_500.trigrams.txt


 The Smart Simplified Chinese toolkit in lucene/analysis/smartcn does not 
 work in some edge cases. It fails to split certain words which were not part 
 of the dictionary or training corpus. 
 This patch supplies a bigramming class to handle these occasional mistakes. 
 The algorithm creates bigrams out of all words longer than two ideograms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3653) Custom bigramming filter for to handle Smart Chinese edge cases

2012-09-23 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461573#comment-13461573
]

Lance Norskog commented on SOLR-3653:
-

Another note: one trigram is the number 15. There are several conventions for
representing integers, including regional quirks. There is no 'number
canonicalizer' in the Smart Chinese toolkit. This could be a problem with
formal documents: historical, government docs, treaties and the like.

[http://en.wikipedia.org/wiki/Chinese_numerals#Whole_numbers]

Custom bigramming filter for to handle Smart Chinese edge cases
---

Key: SOLR-3653
URL: https://issues.apache.org/jira/browse/SOLR-3653
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Reporter: Lance Norskog
Attachments: SmartChineseType.pdf, SOLR-3653.patch,
translations_450.five2thirteen.txt, translations_first_500.quad.txt,
translations_first_500.trigrams.txt

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2510) migrate solr analysis factories to analyzers module

2012-09-23 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461598#comment-13461598
 ] 

Lance Norskog commented on LUCENE-2510:
---

bq. We should open new issues for:
* Update the goddamn wiki
* Add support to solr.class for classes under org.apache.lucene

If you're going to move the walls, please update the blueprints :)

 migrate solr analysis factories to analyzers module
 ---

 Key: LUCENE-2510
 URL: https://issues.apache.org/jira/browse/LUCENE-2510
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
Assignee: Uwe Schindler
 Fix For: 4.0-BETA, 5.0

 Attachments: LUCENE-2510-movefactories.sh, 
 LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
 LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
 LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch


 In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
 This is a good step, but I think the next step is to put the Solr factories 
 into the analyzers module, too.
 This would make analyzers artifacts plugins to both lucene and solr, with 
 benefits such as:
 * users could use the old analyzers module with solr, too. This is a good 
 step to use real library versions instead of Version for backwards compat.
 * analyzers modules such as smartcn and icu, that aren't currently available 
 to solr users due to large file sizes or dependencies, would be simple 
 optional plugins to solr and easily available to users that want them.
 Rough sketch in this thread: 
 http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
 Practically, I havent looked much and don't really have a plan for how this 
 will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-2510) migrate solr analysis factories to analyzers module

2012-09-23 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13461598#comment-13461598
 ] 

Lance Norskog edited comment on LUCENE-2510 at 9/24/12 3:47 PM:


bq. We should open new issues for:
* Update the goddamn wiki

If you're going to move the walls, please update the blueprints :)

  was (Author: lancenorskog):
bq. We should open new issues for:
* Update the goddamn wiki
* Add support to solr.class for classes under org.apache.lucene

If you're going to move the walls, please update the blueprints :)
  
 migrate solr analysis factories to analyzers module
 ---

 Key: LUCENE-2510
 URL: https://issues.apache.org/jira/browse/LUCENE-2510
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
Assignee: Uwe Schindler
 Fix For: 4.0-BETA, 5.0

 Attachments: LUCENE-2510-movefactories.sh, 
 LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
 LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
 LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch


 In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
 This is a good step, but I think the next step is to put the Solr factories 
 into the analyzers module, too.
 This would make analyzers artifacts plugins to both lucene and solr, with 
 benefits such as:
 * users could use the old analyzers module with solr, too. This is a good 
 step to use real library versions instead of Version for backwards compat.
 * analyzers modules such as smartcn and icu, that aren't currently available 
 to solr users due to large file sizes or dependencies, would be simple 
 optional plugins to solr and easily available to users that want them.
 Rough sketch in this thread: 
 http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
 Practically, I havent looked much and don't really have a plan for how this 
 will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2510) migrate solr analysis factories to analyzers module

2012-09-23 Thread Lance Norskog (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Norskog updated LUCENE-2510:
--

Comment: was deleted

(was: bq. We should open new issues for:
* Update the goddamn wiki

If you're going to move the walls, please update the blueprints :))

 migrate solr analysis factories to analyzers module
 ---

 Key: LUCENE-2510
 URL: https://issues.apache.org/jira/browse/LUCENE-2510
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Affects Versions: 4.0-ALPHA
Reporter: Robert Muir
Assignee: Uwe Schindler
 Fix For: 4.0-BETA, 5.0

 Attachments: LUCENE-2510-movefactories.sh, 
 LUCENE-2510-movefactories.sh, LUCENE-2510-multitermcomponent.patch, 
 LUCENE-2510-multitermcomponent.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510-parent-classes.patch, LUCENE-2510-parent-classes.patch, 
 LUCENE-2510.patch, LUCENE-2510.patch, LUCENE-2510.patch, 
 LUCENE-2510-resourceloader-bw.patch, LUCENE-2510-simplify-tests.patch


 In LUCENE-2413 all TokenStreams were consolidated into the analyzers module.
 This is a good step, but I think the next step is to put the Solr factories 
 into the analyzers module, too.
 This would make analyzers artifacts plugins to both lucene and solr, with 
 benefits such as:
 * users could use the old analyzers module with solr, too. This is a good 
 step to use real library versions instead of Version for backwards compat.
 * analyzers modules such as smartcn and icu, that aren't currently available 
 to solr users due to large file sizes or dependencies, would be simple 
 optional plugins to solr and easily available to users that want them.
 Rough sketch in this thread: 
 http://www.lucidimagination.com/search/document/3465a0e55ba94d58/solr_and_analyzers_module
 Practically, I havent looked much and don't really have a plan for how this 
 will work yet, so ideas are very welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-16 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456734#comment-13456734
]

Lance Norskog commented on LUCENE-4345:
---

bq. I don't think this should be using payloads to pull POS tags: the purpose
of payloads
is when you need something stored in the actual index (and should be limited to
e.g. a single byte),
its not type-safe but application-specific.
Yes, some NLP applications want actual payloads. For entity resolution you can
have a UI add little icons for person, place, etc. In the OpenNLP patch it just
seemed silly to add another Attribute type.

bq. If we think its useful for classifiers to limit the analysis to certain POS
categories, then instead we should factor out a minimal POSAttribute
sub-interface with something very generic like isNominal()/isVerbal() that can
actually be implemented by different taggers with different tag sets across
different languages.
There is a generic subset with mapping lists for most common tagsets for
different languages. They map these tags down to 12 POS tags. Adding this
mapper to the OpenNLP patch is on my large TODO list. They even have a mapping
set for the Twitter Parts-of-Speech tagger.

bq. This is currently how Kuromoji works, it has a POS-based stopfilter. these
are trivial to write. I also added a filter to remove payloads. If you use a
different Attribute for the analysis chain, then you need a 'change
POSAttribute to PayloadAttribute' at the bottom of the analysis chain.
Yes, I added one also. Some of the Kuromoji Attributes should be pulled up into
the generic set.

Create a Classification module
--

Key: LUCENE-4345
URL: https://issues.apache.org/jira/browse/LUCENE-4345
Project: Lucene - Core
Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch,
SOLR-3700_2.patch, SOLR-3700.patch

Lucene/Solr can host huge sets of documents containing lots of information in
fields so that these can be used as training examples (w/ features) in order
to very quickly create classifiers algorithms to use on new documents and /
or to provide an additional service.
So the idea is to create a contrib module (called 'classification') to host a
ClassificationComponent that will use already seen data (the indexed
documents / fields) to classify new documents / text fragments.
The first version will contain a (simplistic) Lucene based Naive Bayes
classifier but more implementations should be added in the future.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3760) Build packaging of complex contrib packages just plain does not work

2012-09-14 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13455683#comment-13455683
]

Lance Norskog commented on SOLR-3760:
-

A more detailed explanation of the problem:

* lucene/analysis/X code in a jar.
* Lucene jar depends on third-party jar.
* lucene/analysis/X/ivy.xml downloads third-party jar.
* lucene/analysis/X build works.
* Solr factory in contrib/X depends on Lucene jar which depends on third-party
jar.
* Third-party jar is downloaded in Solr contrib/X/ivy.xml
** Thus, contrib/X build works because classpath is factory-lucene
jar-third-party jar.

However!

* Lucene jar is packed into Solr war.
* Lucene third-party jar is not packed into Solr war.
* Solr factory jar is not packed into Solr war.
* example/solr/collection1/conf/solrconfig.xml refers to
** ../contrib/X/lib
** ../dist/apache-solr-X-.
*** both are in the same classloader- order of lib declarations is not a
problem.
* solrconfig.xml classloader can find Solr factory in apache-solr-X-.jar
* Solr factory classloader supplied by solrconfig.xml can find lucene jar in
solr.war
* Lucene jar uses solr.war classloader
** solr.war classloader cannot see third-party jar
** _solr/contrib/X factory for lucene/analysis/X fails to load_

(I think I got that right.)

Build packaging of complex contrib packages just plain does not work

Key: SOLR-3760
URL: https://issues.apache.org/jira/browse/SOLR-3760
Project: Solr
Issue Type: Improvement
Components: Build
Reporter: Lance Norskog

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3625) Solr conf class loader does not find indirect jars - regression

2012-09-14 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13455684#comment-13455684
 ] 

Lance Norskog commented on SOLR-3625:
-

I did not find a problem with the order of lib directives inside 
solrconfig.xml. All lib directives in solrconfig.xml seem to have one 
classloader. The problem happens when a Lucene jar refers to a third-party jar 
and Solr code outside the war tries to load the factory.

I have added a [detailed explanation of the 
problem|https://issues.apache.org/jira/browse/SOLR-3760?focusedCommentId=13455683page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13455683]
 to [SOLR-3760]

 Solr conf class loader does not find indirect jars - regression
 ---

 Key: SOLR-3625
 URL: https://issues.apache.org/jira/browse/SOLR-3625
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Reporter: Lance Norskog
Assignee: Hoss Man
 Fix For: 4.0, 5.0


 The SolrConf class loader does not find indirectly used jars from external 
 lib directories. This is a regression. It worked as of July 2, 2012, when I 
 posted the most recent OpenNLP patch ([LUCENE-2899]). Something has broken 
 since then.
 This regression is true in both 4.x and the trunk. Both worked on July 2, 
 2012.
 I have a project (the OpenNLP plugin) which uses jars from three places: 
 # solr/contrib/opennlp/src 
 ** tokenizer and filter factories
 # solr/contrib/opennlp/lib 
 ** OpenNLP project libraries
 # lucene/analysis/opennlp/src 
 ** tokenizer and filter
 SolrConf can only find the OpenNLP project jars when I add them to the 
 solr.war libraries. It cannot find them from any of these directories: 
 {code}
 solr/example/solr/lib
 solr/example/solr/collection1/lib
 solr/contrib/opennlp/lib
 {code}
 Here are the relevant config file entries. From solrconfig.xml:
 {code}
   lib dir=../../../dist/ regex=apache-solr-opennlp-\d.*\.jar /
   lib dir=../../../contrib/opennlp/lib regex=.*\.jar /
 {code}
 (yes, it needs to be three dot-dot-slash, not two. See [SOLR-3624].)
 From schema.xml:
 {code}
 !-- OpenNLP all-in-one analyzer --
 fieldType name=text_opennlp class=solr.TextField 
 positionIncrementGap=
 100
 
   analyzer
 tokenizer class=solr.OpenNLPTokenizerFactory
   sentenceModel=opennlp/en-test-sent.bin
   tokenizerModel=opennlp/en-test-tokenizer.bin
 /
   /analyzer
 /fieldType
 {code}
 Here is the log. {{opennlp.tools.sentdetect.SentenceModel}} is a class in the 
 OpenNLP jar.
 {code}
 INFO: Adding 
 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/dist/apache-solr-opennlp-5.0-SNAPSHOT.jar'
  to classloader
 Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 INFO: Adding 
 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/jwnl-1.3.3.jar'
  to classloader
 Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 INFO: Adding 
 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/opennlp-maxent-3.0.2-incubating.jar'
  to classloader
 Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrResourceLoader 
 replaceClassLoader
 INFO: Adding 
 'file:/Users/lancenorskog/Documents/open/solr/trunk/solr/contrib/opennlp/lib/opennlp-tools-1.5.2-incubating.jar'
  to classloader
 Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrConfig init
 INFO: Using Lucene MatchVersion: LUCENE_50
 Jul 15, 2012 6:17:37 PM org.apache.solr.core.SolrConfig init
 INFO: Loaded SolrConfig: solrconfig.xml
 Jul 15, 2012 6:17:37 PM org.apache.solr.schema.IndexSchema readSchema
 INFO: Reading Solr Schema
 Jul 15, 2012 6:17:37 PM org.apache.solr.schema.IndexSchema readSchema
 INFO: Schema name=example
 Jul 15, 2012 6:17:38 PM org.apache.solr.schema.IndexSchema readSchema
 INFO: unique key field: id
 Jul 15, 2012 6:17:38 PM org.apache.solr.schema.FileExchangeRateProvider reload
 INFO: Reloading exchange rates from file currency.xml
 Jul 15, 2012 6:17:38 PM org.apache.solr.schema.FileExchangeRateProvider reload
 INFO: Reloading exchange rates from file currency.xml
 Jul 15, 2012 6:17:38 PM org.apache.solr.common.SolrException log
 SEVERE: null:java.lang.NoClassDefFoundError: 
 opennlp/tools/sentdetect/SentenceModel
   at 
 org.apache.lucene.analysis.opennlp.tools.OpenNLPOpsFactory.getSentenceModel(OpenNLPOpsFactory.java:60)
   at 
 org.apache.solr.analysis.OpenNLPTokenizerFactory.inform(OpenNLPTokenizerFactory.java:90)
   at 
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:584)
   at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:112)
   at org.apache.solr.core.CoreContainer.create(CoreContainer.java:816)
   at

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-14 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog commented on LUCENE-4345:
---

I recently did some related research in text analysis and found that limiting 
terms to nounsverbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

 Create a Classification module
 --

 Key: LUCENE-4345
 URL: https://issues.apache.org/jira/browse/LUCENE-4345
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
 SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-4345) Create a Classification module

2012-09-14 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456053#comment-13456053
 ] 

Lance Norskog edited comment on LUCENE-4345 at 9/15/12 10:20 AM:
-

I recently did some related research in text analysis and found that limiting 
terms to nounsverbs was a 10-15% increase in all variations of the test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out terms based on 
a list of text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]

  was (Author: lancenorskog):
I recently did some related research in text analysis and found that 
limiting terms to nounsverbs was a 10-15% increase in all variations of the 
test.

So, filtering terms from Parts-of-Speech annotation will be very helpful. In my 
OpenNLP patch is a FilterPayloadsFilter which keeps or rips out from a list of 
text payloads.

[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]
  
 Create a Classification module
 --

 Key: LUCENE-4345
 URL: https://issues.apache.org/jira/browse/LUCENE-4345
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
 SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2592) Pluggable shard lookup mechanism for SolrCloud

2012-09-13 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13455183#comment-13455183
 ] 

Lance Norskog commented on SOLR-2592:
-

-1
Naming the shard inside the shard makes it impossible to split or merge shards. 

 Pluggable shard lookup mechanism for SolrCloud
 --

 Key: SOLR-2592
 URL: https://issues.apache.org/jira/browse/SOLR-2592
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Affects Versions: 4.0-ALPHA
Reporter: Noble Paul
Assignee: Mark Miller
 Attachments: dbq_fix.patch, pluggable_sharding.patch, 
 pluggable_sharding_V2.patch, SOLR-2592.patch, SOLR-2592_r1373086.patch, 
 SOLR-2592_rev_2.patch, SOLR_2592_solr_4_0_0_BETA_ShardPartitioner.patch


 If the data in a cloud can be partitioned on some criteria (say range, hash, 
 attribute value etc) It will be easy to narrow down the search to a smaller 
 subset of shards and in effect can achieve more efficient search.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-11 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13453694#comment-13453694
 ] 

Lance Norskog commented on LUCENE-4345:
---

What is the scale that you expect this bayesian classifier to handle? How many 
training documents does it need? 

 Create a Classification module
 --

 Key: LUCENE-4345
 URL: https://issues.apache.org/jira/browse/LUCENE-4345
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: LUCENE-4345_2.patch, LUCENE-4345.patch, 
 SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-03 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447429#comment-13447429
]

Lance Norskog commented on LUCENE-4345:
---

bq. would make training slower but it could be useful to improve accuracy
If you use index data which is already analyzed with the same analyzer as your
test (unseen) documents, you can use a lot more documents as input. More is
better. As the training data increases, signal drives out noise. Once you add
the ability to store load models, training speed becomes less important.

Look at the Mahout project for ideas about text classifiers. The
ConfusionMatrix class and the html page it prints are really handy for
summarizing and probing the classifier's performance.

Create a Classification module
--

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4345) Create a Classification module

2012-09-02 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13447052#comment-13447052
 ] 

Lance Norskog commented on LUCENE-4345:
---

Nice! I've found that filtering for nouns  verbs makes another NLP task 
(latent semantic indexing) work much better. This will benefit from 
parts-of-speech filtering.

 Create a Classification module
 --

 Key: LUCENE-4345
 URL: https://issues.apache.org/jira/browse/LUCENE-4345
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Minor
 Attachments: LUCENE-4345.patch, SOLR-3700_2.patch, SOLR-3700.patch


 Lucene/Solr can host huge sets of documents containing lots of information in 
 fields so that these can be used as training examples (w/ features) in order 
 to very quickly create classifiers algorithms to use on new documents and / 
 or to provide an additional service.
 So the idea is to create a contrib module (called 'classification') to host a 
 ClassificationComponent that will use already seen data (the indexed 
 documents / fields) to classify new documents / text fragments.
 The first version will contain a (simplistic) Lucene based Naive Bayes 
 classifier but more implementations should be added in the future.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3585) processing updates in multiple threads

2012-08-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445775#comment-13445775
 ] 

Lance Norskog commented on SOLR-3585:
-

These seem to be small records. Try indexing large PDF files with the 
ExtractingRequestHandler- these spend a much longer time in the analysis phase 
and have more data to copy around. Try them with and without storing the field: 
stored fields have to be copied during merges.

 processing updates in multiple threads
 --

 Key: SOLR-3585
 URL: https://issues.apache.org/jira/browse/SOLR-3585
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 4.0-ALPHA
Reporter: Mikhail Khludnev
Priority: Minor
 Attachments: multithreadupd.patch, report.tar.gz, SOLR-3585.patch, 
 SOLR-3585.patch


 Hello,
 I'd like to contribute update processor which forks many threads which 
 concurrently process the stream of commands. It may be beneficial for users 
 who streams many docs through single request. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3734) Forever loop in schema browser

2012-08-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446358#comment-13446358
 ] 

Lance Norskog commented on SOLR-3734:
-

If there is an exception or a timeout, the UI should show the problem.

 Forever loop in schema browser
 --

 Key: SOLR-3734
 URL: https://issues.apache.org/jira/browse/SOLR-3734
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis, web gui
Reporter: Lance Norskog
 Attachments: SOLR-3734_schema_browser_blocks_solr_conf_dir.zip


 When I start Solr with the attached conf directory, and hit the Schema 
 Browser, the loading circle spins permanently. 
 I don't know if the problem is in the UI or in Solr. The UI does not display 
 the Ajax solr calls, and I don't have a debugging proxy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3473) Distributed deduplication broken

2012-08-31 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446539#comment-13446539
 ] 

Lance Norskog commented on SOLR-3473:
-

It would be great to have this work in some form, even if it does not have the 
same API as before.


 Distributed deduplication broken
 

 Key: SOLR-3473
 URL: https://issues.apache.org/jira/browse/SOLR-3473
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud, update
Affects Versions: 4.0-ALPHA
Reporter: Markus Jelsma
 Fix For: 4.0

 Attachments: SOLR-3473.patch, SOLR-3473.patch, SOLR-3473-trunk-2.patch


 Solr's deduplication via the SignatureUpdateProcessor is broken for 
 distributed updates on SolrCloud.
 Mark Miller:
 {quote}
 Looking again at the SignatureUpdateProcessor code, I think that indeed this 
 won't currently work with distrib updates. Could you file a JIRA issue for 
 that? The problem is that we convert update commands into solr documents - 
 and that can cause a loss of info if an update proc modifies the update 
 command.
 I think the reason that you see a multiple values error when you try the 
 other order is because of the lack of a document clone (the other issue I 
 mentioned a few emails back). Addressing that won't solve your issue though - 
 we have to come up with a way to propagate the currently lost info on the 
 update command.
 {quote}
 Please see the ML thread for the full discussion: 
 http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3488) Create a Collections API for SolrCloud

2012-08-27 Thread Lance Norskog (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13442757#comment-13442757
 ] 

Lance Norskog commented on SOLR-3488:
-

bq. Yeah, a work queue in ZK makes perfect sense. 

[http://zookeeper-user.578899.n2.nabble.com/Announcing-KeptCollections-distributed-Java-Collections-for-ZooKeeper-td5816709.html]

[https://github.com/anthonyu/KeptCollections]

Distributed Java Collections implementations. Apache licensed. Years of use.

 Create a Collections API for SolrCloud
 --

 Key: SOLR-3488
 URL: https://issues.apache.org/jira/browse/SOLR-3488
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Mark Miller
Assignee: Mark Miller
 Fix For: 4.0

 Attachments: SOLR-3488_2.patch, SOLR-3488.patch, SOLR-3488.patch, 
 SOLR-3488.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-3664) risk of inconsistency in solr(contrib)-module-thirdparty dependencies

2012-08-26 Thread Lance Norskog (JIRA)















































Lance Norskog
 commented on  SOLR-3664


risk of inconsistency in solr(contrib)-module-thirdparty dependencies















Here is another way: repacking dependent jars into all contrib dist/ jars. I've done an experiment with analysis-extras, and it takes 4 seconds to repack the classes from the dependent jars into apache-solr-analysis-extras-5.0-SNAPSHOT.jar.

Add this to solr/contrib/analysis-extras/build.xml:

target name="addjars"
zip destfile="../../dist/apache-solr-analysis-extras-4.0-SNAPSHOT.jar" update="true"
	zipfileset src="" class="code-quote">"../../build/contrib/solr-analysis-extras/lucene-libs/lucene-analyzers-icu-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"../../build/contrib/solr-analysis-extras/lucene-libs/lucene-analyzers-morfologik-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"../../build/contrib/solr-analysis-extras/lucene-libs/lucene-analyzers-smartcn-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"../../build/contrib/solr-analysis-extras/lucene-libs/lucene-analyzers-stempel-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /

	zipfileset src="" class="code-quote">"lib/icu4j-49.1.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"lib/morfologik-fsa-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"lib/morfologik-polish-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
	zipfileset src="" class="code-quote">"lib/morfologik-stemming-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
/zip
/target


Run 'ant dist addjars'. The dist jar goes from 20k to 15M. But, it's 15M in one deployable file. I don't know what to do about META-INF files in the absorbed libraries. This approach just preserves the manifest file.

This approach needs a little rearranging of the order of the build steps. There is no place visible to the contrib build.xml where the solr/build dist jar is finished, but not yet copied to dist/.



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-3664) risk of inconsistency in solr(contrib)-module-thirdparty dependencies

2012-08-26 Thread Lance Norskog (JIRA)















































Lance Norskog
 updated  SOLR-3664


risk of inconsistency in solr(contrib)-module-thirdparty dependencies
















Change By:


Lance Norskog
(26/Aug/12 11:03)




Comment:


Hereisanotherway:repackingdependentjarsintoallcontribdist/jars.Ivedoneanexperimentwithanalysis-extras,andittakes4secondstorepacktheclassesfromthedependentjarsintoapache-solr-analysis-extras-5.0-SNAPSHOT.jar.Addthistosolr/contrib/analysis-extras/build.xml:{code:xml}targetname=addjarszipdestfile=../../dist/apache-solr-analysis-extras-4.0-SNAPSHOT.jarupdate=true	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">	zipfilesetsrc="">/zip/target{code}Runantdistaddjars.Thedistjargoesfrom20kto15M.But,its15Minonedeployablefile.IdontknowwhattodoaboutMETA-INFfilesintheabsorbedlibraries.Thisapproachjustpreservesthemanifestfile.Thisapproachneedsalittlerearrangingoftheorderofthebuildsteps.Thereisnoplacevisibletothecontribbuild.xmlwherethesolr/builddistjarisfinished,butnotyetcopiedtodist/.



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (SOLR-3623) inconsistent treatment of lucene jars third-party deps in analysis-extras uima (in war and in lucene-libs)

2012-08-26 Thread Lance Norskog (JIRA)

Lance Norskog
reopened SOLR-3623

inconsistent treatment of lucene jars third-party deps in analysis-extras uima (in war and in lucene-libs)

I opened this issue to fix jar arrangements so that the OpenNLP integration could work. analysis-extras, opennlp, and uima share the same problem: they use lucene libraries and third-party dependencies.

Fixing license file problems is certainly helpful, but does not make deployment any easer. This issue was essentially hijacked.

Here is one way to make it easy to deploy items outside of the solr war file: repack dependent jars into all contrib dist/ jars. Just pack everything about analysis-extras into dist/analysis-extras.jar. Remove the contrib lucene libraries from the war file.

Add this to solr/contrib/analysis-extras/build.xml:

target name="addjars"
zip destfile="../../dist/apache-solr-analysis-extras-4.0-SNAPSHOT.jar" update="true"
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/common/lucene-analyzers-common-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/icu/lucene-analyzers-icu-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/kuromoji/lucene-analyzers-kuromoji-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/morfologik/lucene-analyzers-morfologik-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/phonetic/lucene-analyzers-phonetic-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/smartcn/lucene-analyzers-smartcn-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"../../../lucene/build/analysis/stempel/lucene-analyzers-stempel-4.0-SNAPSHOT.jar" excludes="META-INF/MANIFEST.MF" /

zipfileset src="" class="code-quote">"lib/icu4j-49.1.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"lib/morfologik-fsa-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"lib/morfologik-polish-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
zipfileset src="" class="code-quote">"lib/morfologik-stemming-1.5.3.jar" excludes="META-INF/MANIFEST.MF" /
/zip
/target

Run 'ant dist addjars'. The dist jar goes from 20k (one file) to 21M (4035 files). But, it is 21M in one deployable file. Everything is in one place!

Caveats:

This approach needs a little rearranging of the order of the build steps. There is no place visible to the contrib build.xml where the solr/build dist jar is finished, but not yet copied to dist/. I don't know what to do about META-INF files in the absorbed libraries. This approach just preserves the manifest file.

Redundant dependencies:

analysis-extras and extraction both use icu4j, which is a huge jar. Too bad.
dataimporter wants all of extraction. Stick with the current arrangement.

This design is appropriate for analysis-extras, uima and opennlp. All of these have lucene libraries and lib/ directories, and the current build arrangement just plain does not work. It is a convenience for clustering, dataimporthandler (-extras), extraction, langid, and velocity.

The build.xml file above needs macro-izing, and as mentioned the build sequence needs a point where the contrib build can repack the dist file inside solr/build.

Change By:

Lance Norskog
(26/Aug/12 11:30)

Status:

Resolved
Reopened

Resolution:

Fixed

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To

[jira] [Commented] (SOLR-3759) mistakes about example-DIH

2012-08-26 Thread Lance Norskog (JIRA)















































Lance Norskog
 commented on  SOLR-3759


mistakes about example-DIH















Cool! This is an illustration of why "one big example" is better: people test it!

The convention in solr/ is to add solr.xml and collection1:
solr/solr.xml
solr/collection1
solr/collection1/conf
... 

Please change example-DIH/solr to match this.



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-3760) Build packaging of complex contrib packages just plain does not work

2012-08-26 Thread Lance Norskog (JIRA)

Lance Norskog created SOLR-3760:
---

 Summary: Build packaging of complex contrib packages just plain 
does not work
 Key: SOLR-3760
 URL: https://issues.apache.org/jira/browse/SOLR-3760
 Project: Solr
  Issue Type: Improvement
  Components: Build
Reporter: Lance Norskog


The build system packages Lucene libraries in the Solr war, but they do not 
pack libraries required by the Lucene libraries. The UIMA and analysis-extras 
contrib packages have factories for the Lucene libraries.

The net effect is that when solrconfig.xml include lib directives for 
dist/xxx-contribX-xxx and solr/contrib/contribX/lib, this fails because the 
lucene analyzer file inside the solr war cannot find the library files in 
solr/contrib/contribX/lib because the classloader for the war does not find the 
libraries from the lib directives.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 3 4 5 6 >

1 - 100 of 532 matches

Mail list logo