[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-31 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286474#comment-13286474
 ] 

Chris Male commented on LUCENE-4019:


Hi Luca,

Thanks for taking a shot at this.

I wonder whether we can do improve the ParseException message? At the very 
least it should include the line that is causing the problem so people can find 
it.  What would be even better is if we also included the line number.  The 
latter is probably not so urgent, but it would be handy to have for other 
parsing errors too.

Also I think the changes to the Factory are wrong:

{code}
+  if(strictAffixParsing.equalsIgnoreCase(TRUE)) ignoreCase = true;
+  else if(strictAffixParsing.equalsIgnoreCase(FALSE)) ignoreCase = false;
{code}



 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch, LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-28 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13284393#comment-13284393
 ] 

Chris Male commented on LUCENE-4019:


Hi Luca,

Sorry for taking so long to get to this.  Patch looks good and seems to fix the 
problem.  I think we do need some way to force 'strict' parsing of the files.  
Do you think you can add a option for that? When strict parsing is enabled, 
lines without the expected number of elements cause an error.  

We can even have this enabled by default so users have to explicitly say that 
they know the dictionary doesn't conform to our standard and are okay with us 
silently ignoring bad rules.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269516#comment-13269516
 ] 

Luca Cavanna commented on LUCENE-4019:
--

Thank you Robert for the explanation!
In this specific case it's hard to understand the differences between hunspell 
and Lucene, since Lucene doesn't even parse the affix file.
I've been in contact with the authors of those Ducth dictionaries, as well as 
with the hunspell author. It turned out that those affix rules are wrong and 
hunspell actually ignores them. I think it's better to ignore them in Lucene 
too, rather than throwing an exception, which makes impossible to use those 
dictionaries at all.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-04-24 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260612#comment-13260612
 ] 

Luca Cavanna commented on LUCENE-4019:
--

Robert, with spec I meant exactly your links :)
Actually it's clear that the affix header has 4 elements while each rule has at 
least 5 elements. I don't really know what hunspell does with that kind of 
malformed rules. Lucene just throws an error while loading the dictionary. 
Looking at the hunspell source code, I might be wrong but I suspect it just 
skips that specific rule with some warning. But honestly it's hard to believe 
that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll 
investigate more, thanks!

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-04-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260638#comment-13260638
 ] 

Robert Muir commented on LUCENE-4019:
-

its tough to know for sure. in general a lot of hunspell dictionaries cannot be 
parsed.
There are a ton of these, under many strange licenses and they are very large.

A Test scaffolding of sorts could probably be done to hunt out problems:
* download all dictionaries you can find
* for each one, use hunspell command-line tools like munch, unmunch (which 
applies all the rules), etc
  to generate some sort of expected output in .txt format.
* for each one, do the same using the hunspell parsing here.
* compare results: when things differ, try to boil it down to a compact 
.aff/.dic, with a test case and fix and commit.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org