wordnet parsing bug
-------------------
Key: LUCENE-2001
URL: https://issues.apache.org/jira/browse/LUCENE-2001
Project: Lucene - Java
Issue Type: Bug
Components: contrib/*
Affects Versions: 2.9
Reporter: Robert Muir
Priority: Minor
A user reported that wordnet parses the prolog file incorrectly.
Also need to check the wordnet parser in the memory contrib for this problem.
If this is a false alarm, i'm not worried, because the test will be the first
unit test wordnet package ever had.
{noformat}
For example, looking up the synsets for the
word "king", we get:
java SynLookup wnindex king
baron
magnate
mogul
power
queen
rex
scrofula
struma
tycoon
Here, "scrofula" and "struma" are extraneous. This happens because, the line
parser code in Syns2Index.java interpretes the two consecutive single quotes
in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as
termination
of the string and separates into "king". This entry concerns
synset of words "scrofula" and "struma", and thus they get inserted in the
synset of "king". *There 1382 such entries, in wn_s.pl* and more in other
WordNet
Prolog data-base files, where such use of two consecutive single quotes
appears.
We have resolved this by adding a statement in the line parsing portion of
Syns2Index.java, as follows:
// parse line
line = line.substring(2);
* line = line.replaceAll("\'\'", "`"); // added statement*
int comma = line.indexOf(',');
String num = line.substring(0, comma); ... ... etc.
In short we replace "''" by "`" (a back-quote). Then on recreating the
index, we get:
java SynLookup zwnindex king
baron
magnate
mogul
power
queen
rex
tycoon
{noformat}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]