[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286474#comment-13286474 ] Chris Male commented on LUCENE-4019: Hi Luca, Thanks for taking a shot at this. I wonder whether we can do improve the ParseException message? At the very least it should include the line that is causing the problem so people can find it. What would be even better is if we also included the line number. The latter is probably not so urgent, but it would be handy to have for other parsing errors too. Also I think the changes to the Factory are wrong: {code} + if(strictAffixParsing.equalsIgnoreCase(TRUE)) ignoreCase = true; + else if(strictAffixParsing.equalsIgnoreCase(FALSE)) ignoreCase = false; {code} Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna Assignee: Chris Male Attachments: LUCENE-4019.patch, LUCENE-4019.patch We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284393#comment-13284393 ] Chris Male commented on LUCENE-4019: Hi Luca, Sorry for taking so long to get to this. Patch looks good and seems to fix the problem. I think we do need some way to force 'strict' parsing of the files. Do you think you can add a option for that? When strict parsing is enabled, lines without the expected number of elements cause an error. We can even have this enabled by default so users have to explicitly say that they know the dictionary doesn't conform to our standard and are okay with us silently ignoring bad rules. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna Assignee: Chris Male Attachments: LUCENE-4019.patch We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269516#comment-13269516 ] Luca Cavanna commented on LUCENE-4019: -- Thank you Robert for the explanation! In this specific case it's hard to understand the differences between hunspell and Lucene, since Lucene doesn't even parse the affix file. I've been in contact with the authors of those Ducth dictionaries, as well as with the hunspell author. It turned out that those affix rules are wrong and hunspell actually ignores them. I think it's better to ignore them in Lucene too, rather than throwing an exception, which makes impossible to use those dictionaries at all. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260612#comment-13260612 ] Luca Cavanna commented on LUCENE-4019: -- Robert, with spec I meant exactly your links :) Actually it's clear that the affix header has 4 elements while each rule has at least 5 elements. I don't really know what hunspell does with that kind of malformed rules. Lucene just throws an error while loading the dictionary. Looking at the hunspell source code, I might be wrong but I suspect it just skips that specific rule with some warning. But honestly it's hard to believe that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll investigate more, thanks! Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260638#comment-13260638 ] Robert Muir commented on LUCENE-4019: - its tough to know for sure. in general a lot of hunspell dictionaries cannot be parsed. There are a ton of these, under many strange licenses and they are very large. A Test scaffolding of sorts could probably be done to hunt out problems: * download all dictionaries you can find * for each one, use hunspell command-line tools like munch, unmunch (which applies all the rules), etc to generate some sort of expected output in .txt format. * for each one, do the same using the hunspell parsing here. * compare results: when things differ, try to boil it down to a compact .aff/.dic, with a test case and fix and commit. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org