[ https://issues.apache.org/jira/browse/LUCENE-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566220#action_12566220 ]
Thomas Peuss commented on LUCENE-1166: -------------------------------------- bq. Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project. This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem. bq. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way? OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight. All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow. BTW: an example for those that don't want to try the patch: +Input token stream:+ Rindfleischüberwachungsgesetz Drahtschere abba +Output token stream:+ (Rindfleischüberwachungsgesetz,0,29) (Rind,0,4,posIncr=0) (fleisch,4,11,posIncr=0) (überwachung,11,22,posIncr=0) (gesetz,23,29,posIncr=0) (Drahtschere,30,41) (Draht,30,35,posIncr=0) (schere,35,41,posIncr=0) (abba,42,46) > A tokenfilter to decompose compound words > ----------------------------------------- > > Key: LUCENE-1166 > URL: https://issues.apache.org/jira/browse/LUCENE-1166 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis > Reporter: Thomas Peuss > Attachments: CompoundTokenFilter.patch, de.xml, hyphenation.dtd > > > A tokenfilter to decompose compound words you find in many germanic languages > (like German, Swedish, ...) into single tokens. > An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so > that you can find the word even when you only enter "Schiff". > I use the hyphenation code from the Apache XML project FOP > (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. > Currently I use the FOP jars directly. I only use a handful of classes from > the FOP project. > My question now: > Would it be OK to copy this classes over to the Lucene project (renaming the > packages of course) or should I stick with the dependency to the FOP jars? > The FOP code uses the ASF V2 license as well. > What do you think? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]