CTAKES-63 - Lucene search breaks with a dash(-) and a special tokens such as brackets ]

Chen, Pei Mon, 01 Oct 2012 14:34:33 -0700

Hi folks,
I was looking into the bug https://issues.apache.org/jira/browse/CTAKES-63
Where the lucene dictionary lookup would break with a search string such as: 
"mailto:[email protected]<mailto:[email protected]>]"
After some debugging, this happens when the token contains a dash (-), and 
contains a special char such as the right bracket].
//I believe all of the chars in the QueryParser str token should be escaped to 
avoid issues such as a token ending with ']'


Before we add and test the proposed fixed (add escape() call) such as below, I 
also noticed another potential issue: we do search and replace of all dashes 
into spaces.  Just wanted to ensure that this was done intentionally and works 
fine because the dashes have already been removed in the index.  Otherwise, 
we'll need to actually replace the dash with a '?' instead of a space or use a 
phrasequery instead of termquery.  Would be great if someone familiar with this 
bit of code to confirm...

LuceneDictionaryImpl.java (dictionary-lookup) [~Line 106]

              if (str.indexOf('-') == -1) {
                     q = new TermQuery(new Term(iv_lookupFieldName, str));
                     topDoc = iv_searcher.search(q, iv_maxHits);
              }
              else {  // needed the KeyworkAnalyzer for situations where the 
hypen was included in the f-word
                     QueryParser query = new QueryParser(Version.LUCENE_30, 
iv_lookupFieldName, new KeywordAnalyzer());
                     try {
                           //topDoc = 
iv_searcher.search(query.parse(str.replace('-', ' ')), iv_maxHits);
                           //proposed fixed
                            String escaped = 
QueryParser.escape(str.replace('-', ' '));
                            topDoc = iv_searcher.search(query.parse(escaped), 
iv_maxHits);
                           } catch (ParseException e) {
                                  // TODO Auto-generated catch block
                                  e.printStackTrace();
                           }
              }

CTAKES-63 - Lucene search breaks with a dash(-) and a special tokens such as brackets ]

Reply via email to