subject:"\[jira\] \[Commented\] \(LUCENE\-2341\) explore morfologik integration"

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-07-02 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059139#comment-13059139
 ] 

Michał Dybizbański commented on LUCENE-2341:


Thanks :)

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Fix For: 4.0
>
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> LUCENE-2341.diff, LUCENE-2341.patch, LUCENE-2341.patch, 
> morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, 
> morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-29 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057119#comment-13057119
 ] 

Dawid Weiss commented on LUCENE-2341:
-

You do like those pesky .toString() calls, don't you? :) I replaced the code 
slightly to keep char. sequences only; no need to create new objects. I also 
changed the impl. a bit to go from the start of the returned list -> 
theoretically, lemmas should be ordered by probability (in practice it's not 
the case, but may be in the future).

All looks good, committed in. Thanks!

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> LUCENE-2341.diff, LUCENE-2341.patch, morfologik-fsa-1.5.2.jar, 
> morfologik-polish-1.5.2.jar, morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-28 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057034#comment-13057034
 ] 

Dawid Weiss commented on LUCENE-2341:
-

Thanks Michał. I'll review it later today and commit in if there are no 
objections. As for the deleted line -- yes, it was intentional; we'll piggyback 
in this patch unless somebody fixes it earlier, no problem.

Dawid

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> LUCENE-2341.diff, LUCENE-2341.patch, morfologik-fsa-1.5.2.jar, 
> morfologik-polish-1.5.2.jar, morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-28 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056442#comment-13056442
 ] 

Dawid Weiss commented on LUCENE-2341:
-

I've cleaned up the patch, but I'd still address the two TODOs that I left in 
the code:

- lowercasing should be done not at the external filter level, but inside the 
filter as a fallback IF AND ONLY IF the original sequence is not found in the 
dictionary. Morfeusz and Morfologik do have uppercase surface forms and do 
treat them differently (returning uppercase lemmas, for example). A test for 
this would be nice as well. An example of an uppercase/mixed surface form: AGD, 
Aaron, Poznania.

- I'd expose another attribute with morphosyntactic annotations -- this is 
something that is there anyway, so why not expose it.

I attached a git diff, but it should apply with patch -p1 < ... too. Michał, 
will you have the time to polish this off?

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> LUCENE-2341.patch, morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, 
> morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-28 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056430#comment-13056430
 ] 

Dawid Weiss commented on LUCENE-2341:
-

Working on the integration, will provide a final patch before committing. 
Thanks Michał.

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, 
> morfologik-stemming-1.5.0.jar, morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-27 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056261#comment-13056261
 ] 

Robert Muir commented on LUCENE-2341:
-

{quote}
provided each thread obtains its own TokenStreamComponents through 
ReusableAnalyzerBase.createComponents (is this always the case ? looking at 
other filters, thay don't look thread-safe neither ..)
{quote}

yes, its the case that Analyzer/ReusableAnalyzerBase take care of this with a 
threadlocal, as long as each thread only needs to use one tokenstream at a time 
(which is true for all lucene consumers), see:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/analysis/Analyzer.java


> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, LUCENE-2341.diff, 
> morfologik-fsa-1.5.2.jar, morfologik-polish-1.5.2.jar, 
> morfologik-stemming-1.5.0.jar, morfologik-stemming-1.5.2.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-21 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053079#comment-13053079
 ] 

Dawid Weiss commented on LUCENE-2341:
-

bq. Dawid, do you think it's reasonable to optimize further and use directly a 
list returned by IStemmer.lookup (instead of copying with addAll) ? My concern 
is that (at least in current DictionaryLookup implementation) that list seems 
to be shared by distinct invocations of the lookup method, which would make the 
use of a specific IStemmer not applicable in thread-safe code.

IStemmer implementations are not thread safe anyway, so there is no problem in 
reusing that list. In fact, the returned WordData objects are reused internally 
as well, so you can't store them either (this is done to avoid GC overhead). 

So yes: I missed that, but you'll need to ensure IStemmer instances are not 
shared. This can be done in various ways (thread local, etc), but I think the 
simplest way to do it would be to instantiate PolishStemmer at the 
MorfologikFilter level. This is cheap (the dictionary is loaded once anyway). 

You can then create two constructors in the analyzer -- one with 
PolishStemmer.DICTIONARY and one with the default (I'd suggest MORFOLOGIK). 
Exposing IStemmer constructor will do more harm than good -- thinking ahead is 
good, but in this case I don't think there'll be this many people interested in 
subclassing IStemmer (if anything, they'll plug into Lucene's infrastructure 
directly).

A simple test case spawning 5 or 10 threads in a parallel executor and 
crunching stems on the same analyzer would also be nice to ensure we have 
everything correct wrt multithreading, but it's not that crucial if you don't 
have the time to write it.

Thanks!

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, LUCENE-2341.diff, 
> morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-21 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052483#comment-13052483
 ] 

Dawid Weiss commented on LUCENE-2341:
-

I've just published morfologik 1.5.2, Michał. This comes with two dictionaries 
(morfologik and morfeusz) that can be used as one (fallback for missing words) 
or separately, but I would stick to using morfologik as the default dictionary 
(possibly with an option of using morfeusz?). POS tags have a different 
notation in these two resources, so mixing both is probably not a good idea.

Will you update the patch? Thanks.

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-21 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052451#comment-13052451
 ] 

Robert Muir commented on LUCENE-2341:
-

{quote}
Eventually it would be probably sensible to limit the automaton for use in 
Lucene to store surface forms and lemmas only (no POS tags) and merge both 
dictionaries into a single automaton... but this can be a future improvement.
{quote}

or alternatively, you can expose the POS tags for each stem to lucene right, 
easiest way would be to put it into TypeAttribute (a string), but you could 
make your own strongly-typed one if thats a better fit.
 
this could be useful for downstream processing.


> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-21 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052423#comment-13052423
 ] 

Dawid Weiss commented on LUCENE-2341:
-

One note wrt patch: I would use an explicit pointer over a list of returned 
WordData entries instead of adding them to a local list:

private List stemsAcc = new ArrayList();

Right now you're shifting the internal array on each call unnecessarily (just 
increase an int ptr instead):

+  termAtt.setEmpty().append(stemsAcc.remove(0).getStem().toString());

getStem() should also be enough since it's a CharSequence, right? No need for 
an intermediate String.

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-21 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052421#comment-13052421
 ] 

Dawid Weiss commented on LUCENE-2341:
-

I did some analyses on both dictionaries.
{noformat}
Number of lines (distict surface forms):

  3.662.366 morfologik.utf8
  5.086.141 sgjp.utf8

Distinct words (not in both):

  2.729.334 unique.utf8

  - upper/lower case (morfologik has upper case forms, morfeusz only lower case 
surface forms)

acerze
Acerze

  - very rare or jargon;

abszminka
abszytowałem
acetobakteria
acetarsolowi
niebombiasto
hakatystce
hakatystycznościach
warzże

  - differences in spelling;

abelard
abélard

  - acronyms and super-short stuff

aap
aar

Dictinct normalized (lowercase):

  2.564.366 lowered.utf8

  Most of these are very infrequent words or inflection forms. There are minor 
differences or
  missing surface forms in both dictionaries, as in here (mz - morfeusz, mk - 
morfologik):

mz> hakersko
mz> hakerskość
mz> hakerskości
mz> hakerskością
mz> hakerskościach
mz> hakerskościami
mz> hakerskościom
mk> hakerstw
mk> hakerstwa
...
mk> hakowałyśmy
mk> hakowań
mk> hakowaniach
mk> hakowaniami
mk> hakowaniom
mz> hakowatość
mz> hakowatości
mz> hakowatością
mz> hakowatościach
mz> hakowatościami
mz> hakowatościom
{noformat}

So... the conclusion is pretty consistent with Zipf's law: both dictionaries 
have a fairly different coverage, even if they're quite large. We don't have a 
frequency dictionary for Polish, but I assume most of these surface forms are 
purely theoretical and occur super-rarely in practice. This said, I think we 
should use BOTH dictionaries -- after all there's no harm done if we overdo the 
lemmatization process a little bit, is there?

So... my proposal would be this: I'll integrate Morfeusz's dictionary in 
Morfologik (as an alternative dictionary one can load and use). 

Eventually it would be probably sensible to limit the automaton for use in 
Lucene to store surface forms and lemmas only (no POS tags) and merge both 
dictionaries into a single automaton... but this can  be a future improvement.



> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-20 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052380#comment-13052380
 ] 

Dawid Weiss commented on LUCENE-2341:
-

I'll take a look at the differences between Morfologik and Morfeusz right now, 
actually. I'll post the results once I have something.

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-20 Thread Dawid Weiss (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052376#comment-13052376
 ] 

Dawid Weiss commented on LUCENE-2341:
-

Thanks for the contribution, Michał. 

Robert: the dictionary is licensed under MPL or CC-SA (to be selected by the 
user depending on one's needs). Do you know which one is preferable over 
another?

Michał: there is also another (much larger) dictionary that has been released 
recently and comes from the Morfeusz project. 
http://sgjp.pl/morfeusz/dopobrania.html This dictionary is actually licensed 
under BSD license, so no legal worries at all. Both dictionaries are nearly 
identical (they differ slightly in their convention of morphosyntactic 
annotations) and Morfeusz's dictionary could be compiled into an automaton for 
use with Morfologik.

Which way should we go? What do you think?

> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052251#comment-13052251
 ] 

Robert Muir commented on LUCENE-2341:
-

Sorry, about my second comment i was confusing this with the stuff you have for 
the morfologik jar itself, which is correct :)

What i should have said was, I think we should include this information in the 
top-level modules/analysis/LICENSE.txt and modules/analysis/NOTICE.txt





> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

2011-06-20 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052246#comment-13052246
 ] 

Robert Muir commented on LUCENE-2341:
-

Hi Michał,

This patch looks great!

I took a quick glance, here are a couple suggestions:
* In the MorfologikFilter, I think we should implement reset(), first calling 
the superclass reset(), then clearing the stemsAcc list. This ensures that all 
of the filter's state is cleared before it is reused. Under normal operations, 
this should not be necessary, but some consumers in Lucene (e.g. 
LimitTokenCountFilter, and some similar code in the Highlighter), will only 
partially consume up to some point, then suddenly stop. By clearing this list 
in reset() we ensure that there is no chance any leftover stems will appear in 
the next stream.
* because the data is licensed under MPL, I think we should explicitly list a 
hyperlink if possible to the source code used in the NOTICE.txt. I saw you 
included some wordage in LICENSE.txt but I think this should only say 'XYZ data 
is under this license, with the actual MPL license text. In the NOTICE.txt we 
should link to the source code I think... there is some more information on 
this under the section Category B: Reciprocal Licenses at 
http://www.apache.org/legal/3party.html


> explore morfologik integration
> --
>
> Key: LUCENE-2341
> URL: https://issues.apache.org/jira/browse/LUCENE-2341
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Robert Muir
>Assignee: Dawid Weiss
> Attachments: LUCENE-2341.diff, morfologik-stemming-1.5.0.jar
>
>
> Dawid Weiss mentioned on LUCENE-2298 that there is another Polish stemmer 
> available:
> http://sourceforge.net/projects/morfologik/
> This works differently than LUCENE-2298, and ideally would be another option 
> for users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

[jira] [Commented] (LUCENE-2341) explore morfologik integration

15 matches

Site Navigation

Mail list logo

Footer information