Hi, thanks again for reporting this. I created an issue here: http://issues.apache.org/jira/browse/LUCENE-2001
On Wed, Oct 21, 2009 at 2:05 AM, parag dave <phdaveoffic...@gmail.com>wrote: > While using the Lucene WordNet package, we found that the Syns2Index > program > indexes the Synsets wrongly. For example, looking up the synsets for the > word "king", we get: > > java SynLookup wnindex king > baron > magnate > mogul > power > queen > rex > scrofula > struma > tycoon > > Here, "scrofula" and "struma" are extraneous. This happens because, the > line > parser code in Syns2Index.java interpretes the two consecutive single > quotes > in entry s(114144247,3,'king''s evil',n,1,1) in wn_s.pl file, as > termination > of the string and separates into "king". This entry concerns > synset of words "scrofula" and "struma", and thus they get inserted in the > synset of "king". *There 1382 such entries, in wn_s.pl* and more in other > WordNet > Prolog data-base files, where such use of two consecutive single quotes > appears. > > We have resolved this by adding a statement in the line parsing portion of > Syns2Index.java, as follows: > > // parse line > line = line.substring(2); > * line = line.replaceAll("\'\'", "`"); // added statement* > int comma = line.indexOf(','); > String num = line.substring(0, comma); ... ... etc. > In short we replace "''" by "`" (a back-quote). Then on recreating the > index, we get: > > java SynLookup zwnindex king > baron > magnate > mogul > power > queen > rex > tycoon > > *Recently lucene-2.9.0 has been released, but wordnet package included in > it > still has the same problem given above.* > > -- Parag H. Dave > -- Robert Muir rcm...@gmail.com