[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-06-01 Thread Tanguy Moal (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287408#comment-13287408
 ] 

Tanguy Moal commented on LUCENE-4063:
-

I agree with both of you, it sounds like a design change.

I think Jacques Savoy's algorithm was intended to be used on words. Not on 
numbers, or mixes of both (like in 22h00).

Which is true for any stemmer, I think. That's why on the mailing I also 
suggested we could have each stemmer share a common interface that would filter 
non-stemmable literals out of the way. That could prevent the same issue to 
raise from a different stemming implementation.

I'm just saying this as I think about it.

 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Assignee: Steven Rowe
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, 
 SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-06-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287447#comment-13287447
 ] 

Robert Muir commented on LUCENE-4063:
-

{quote}
That's why on the mailing I also suggested we could have each stemmer share a 
common interface that would filter non-stemmable literals out of the way
{quote}

We actually have this in place, but its too limited. Its called 
KeywordAttribute. When this is set, the stemmer will not touch the word.

Currently the only way to set this out of box is to use KeywordMarkerFilter 
which takes a Set of protected words.

But to make your idea more flexible: I could imagine a couple more filters:
* one that marks as Keyword based on a set of types. In this case you would 
just add NUM to that set, and no stemmers would touch any numbers. Of course
  for french this is solved already, but imagine if you are using the URLEmail 
tokenizer: I think a set like { URL, EMAIL } would be very useful,
  otherwise stemmers will probably muck with them.
* one that marks as Keyword based on a regular expression. This could be good 
for fine-tuning stemmers for a lot of general purpose needs: e.g. on the 
  mailing list before someone was unhappy about how russian stemmers would 
treat russian place names and they had a certain set of suffixes they didnt
  want stemmed.

Anyway, I would really like to see these filters, I think they would be pretty 
simple to implement as well. 

 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Assignee: Steven Rowe
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, 
 SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-05-17 Thread Tanguy Moal (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277635#comment-13277635
 ] 

Tanguy Moal commented on LUCENE-4063:
-

I'd be glad to see this on 3.x x =4 since that's the version I used to spot 
the issue, may be should I have marked this issue as a bug rather than 
improvement ? :-)

I have a custom filterfactory marking numbers as keywords anyway as I needed a 
quick fix.
So from my point of view it doesn't really matter... I could just drop that 
filter from my analysis if the patch finds its way to 3x.

Thank you very much for your quick responses about this issue.

 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Assignee: Steven Rowe
Priority: Minor
 Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, 
 SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-05-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277728#comment-13277728
 ] 

Robert Muir commented on LUCENE-4063:
-

As far as this being a bug, the original code implements the algorithm it 
claims to implement, and undoubling anything was its heuristic: see 
http://members.unine.ch/jacques.savoy/clef/frenchStemmerPlus.txt


 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Assignee: Steven Rowe
Priority: Minor
 Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, 
 SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-05-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277051#comment-13277051
 ] 

Steven Rowe commented on LUCENE-4063:
-

Tanguy, since this is entirely a Lucene change, I've moved the issue's project 
from Solr to Lucene.

 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Priority: Minor
 Attachments: SOLR-3463.patch, SOLR-3463.patch, SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4063) FrenchLightStemmer performs abusive compression of (arbitrary) repeated characters in long tokens

2012-05-16 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277066#comment-13277066
 ] 

Steven Rowe commented on LUCENE-4063:
-

Committed to trunk.  Thanks Tanguy!

I'm not sure if this should be committed on the 3.6 branch, since that branch 
is bug-fix only, and this issue is marked as an improvement.  Thoughts?

 FrenchLightStemmer performs abusive compression of (arbitrary) repeated 
 characters in long tokens
 -

 Key: LUCENE-4063
 URL: https://issues.apache.org/jira/browse/LUCENE-4063
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.4, 4.0
Reporter: Tanguy Moal
Assignee: Steven Rowe
Priority: Minor
 Attachments: LUCENE-4063.patch, SOLR-3463.patch, SOLR-3463.patch, 
 SOLR-3463.patch


 FrenchLightStemmer performs aggressive deletions on repeated character 
 sequences, even on numbers.
 This might be unexpected during full text search.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org