Hi folks,
I was looking into the bug https://issues.apache.org/jira/browse/CTAKES-63
Where the lucene dictionary lookup would break with a search string such as:
"mailto:[email protected]<mailto:[email protected]>]"
After some debugging, this happens when the token contains a dash (-), and
contains a special char such as the right bracket].
//I believe all of the chars in the QueryParser str token should be escaped to
avoid issues such as a token ending with ']'
Before we add and test the proposed fixed (add escape() call) such as below, I
also noticed another potential issue: we do search and replace of all dashes
into spaces. Just wanted to ensure that this was done intentionally and works
fine because the dashes have already been removed in the index. Otherwise,
we'll need to actually replace the dash with a '?' instead of a space or use a
phrasequery instead of termquery. Would be great if someone familiar with this
bit of code to confirm...
LuceneDictionaryImpl.java (dictionary-lookup) [~Line 106]
if (str.indexOf('-') == -1) {
q = new TermQuery(new Term(iv_lookupFieldName, str));
topDoc = iv_searcher.search(q, iv_maxHits);
}
else { // needed the KeyworkAnalyzer for situations where the
hypen was included in the f-word
QueryParser query = new QueryParser(Version.LUCENE_30,
iv_lookupFieldName, new KeywordAnalyzer());
try {
//topDoc =
iv_searcher.search(query.parse(str.replace('-', ' ')), iv_maxHits);
//proposed fixed
String escaped =
QueryParser.escape(str.replace('-', ' '));
topDoc = iv_searcher.search(query.parse(escaped),
iv_maxHits);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}