[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

Thomas Peuss (JIRA) Wed, 06 Feb 2008 09:35:33 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566220#action_12566220
 ]


Thomas Peuss commented on LUCENE-1166:
--------------------------------------

bq. Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which 
seems to be the same information as in the off-hyphenation.zip file you 
attached to this issue, the license issue may be a problem - the hyphenation 
data is covered by different licenses on a per-language basis. For example, 
there are two German data files, and both are licensed under a LaTeX license, 
as is the Danish file, and these two languages are the most likely targets for 
your TokenFilter. IANAL, but unless Apache licenses can be secured for this 
data, I don't think the files can be incorporated directly into an Apache 
project.

This is true. And that's why I uploaded the two files without the ASF license 
grant. The FOP project does not have the files in the code base as well because 
of the licensing problem.

bq. Also, I don't see Swedish among the hyphenation data licenses - is it 
covered in some other way?
OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of 
the LaTeX grammar files. I have a look into this tonight.

All other hyphenation implementations I have found so far use them either 
directly or in an converted variant like the FOP code. What we can do of course 
is to ask the authors of the LaTeX files if they want to license their work 
under the ASF license as well. It is worth a try. But I suppose that many email 
addresses in the LaTeX files are not used anymore. I try to contact the authors 
of the German grammar files tomorrow.

BTW: an example for those that don't want to try the patch:
+Input token stream:+
Rindfleischüberwachungsgesetz Drahtschere abba

+Output token stream:+
(Rindfleischüberwachungsgesetz,0,29)
(Rind,0,4,posIncr=0)
(fleisch,4,11,posIncr=0)
(überwachung,11,22,posIncr=0)
(gesetz,23,29,posIncr=0)
(Drahtschere,30,41)
(Draht,30,35,posIncr=0)
(schere,35,41,posIncr=0)
(abba,42,46)

> A tokenfilter to decompose compound words
> -----------------------------------------
>
>                 Key: LUCENE-1166
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1166
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Thomas Peuss
>         Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd
>
>
> A tokenfilter to decompose compound words you find in many germanic languages 
> (like German, Swedish, ...) into single tokens.
> An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so 
> that you can find the word even when you only enter "Schiff".
> I use the hyphenation code from the Apache XML project FOP 
> (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. 
> Currently I use the FOP jars directly. I only use a handful of classes from 
> the FOP project.
> My question now:
> Would it be OK to copy this classes over to the Lucene project (renaming the 
> packages of course) or should I stick with the dependency to the FOP jars? 
> The FOP code uses the ASF V2 license as well.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1166) A tokenfilter to decompose compound words

Reply via email to