Re: Token implementation

Michael McCandless Sat, 12 Jul 2008 04:55:49 -0700

Or we could leave termText() deprecated, add term() which does thesame thing sub-optimally (ie, always creates new String from thebyte[]), and in the javadocs for termText() state that you can migrateeither term() (if you really want a String and you understand theperformance cost of doing so) or to the re-use APIs?


Mike

DM Smith wrote:

Michael McCandless wrote:
Maybe we should un-deprecate the termText() method but add javadocsexplaining that for better performance you should use the char[]reuse methods instead?
I think so, too. Should we leave it as deprecated until 3.0? Withthe performance note and the encouragement to go for re-use, butalso with a note that the current implementation is deprecated notthe interface.
That's not quite what deprecated means. My thought on this is thatit will give everyone a heads up that the current implementation isgoing away and that the replacement is sub-optimal.
(I use Eclipse and have it set to flag all deprecated uses. Thishelps me look for places to change.)
I think that this will make migration to 3.0 be much easier.

With this changing Term to add Term(String, Token) won't be necessary.

-- DM
Mike

DM Smith wrote:
Michael McCandless wrote:
DM Smith wrote:
Shouldn't Term have constructors that take a Token?
I think that makes sense, though normally Token appears duringanalysis and Term during searching (I think?) -- how often wouldyou need to make a Term from a Token?
The problem I'm addressing is that tokens are used in contextsthat need String and not char[].
The call to the deprecated
String termText = token.termText();
needs to be replaced with:
String termText = new String(token.termBuffer(), 0,token.termLength());
There are over 170 calls to token.termText(), each of these placeshave to be modified. In some, perhaps many, of these cases it maybe possible to use char[] directly to get a performance gain.
In the case of Term changing it to work with char[] buffer, intstart, int length, does not seem quite right. I think the ripplewould keep getting bigger. But logically, the Term's text is thetext of a Token.
To me it makes sense to have a method that returns the token as aString, but that method is deprecated and the suggestedreplacement is to directly use the buffer. So this leads to theabove construct. Perhaps it would be good to add a new method anddocument that as one of two replacements.
public String term() {
return termText != null ? termText : newString(token.termBuffer(), 0, token.termLength());
}
Here is an example from QueryParser that has 5 instances, eachcalling the deprecated t.termText() method. In this example, thereis the construction of a query from a token stream.
Each of the problem lines are of the pattern:
TermQuery currentQuery = new TermQuery(new Term(field,t.termText()));
To remove the deprecated call to t.termText(), the Token's bufferneeds to be marshalled with something like:String termText = new String(token.termBuffer(), 0,token.termLength());
TermQuery currentQuery = new TermQuery(new Term(field, termText)));

/**
* @exception ParseException throw in overridden method to disallow
*/
protected Query getFieldQuery(String field, String queryText)throws ParseException {// Use the analyzer to get all the tokens, and then build aTermQuery,
 // PhraseQuery, or nothing based on the term count
TokenStream source = analyzer.tokenStream(field, newStringReader(queryText));
 Vector v = new Vector();
 org.apache.lucene.analysis.Token t;
 int positionCount = 0;
 boolean severalTokensAtSamePosition = false;

 while (true) {
   try {
     t = source.next();
   }
   catch (IOException e) {
     t = null;
   }
   if (t == null)
     break;
   v.addElement(t);
   if (t.getPositionIncrement() != 0)
     positionCount += t.getPositionIncrement();
   else
     severalTokensAtSamePosition = true;
 }
 try {
   source.close();
 }
 catch (IOException e) {
   // ignore
 }

 if (v.size() == 0)
   return null;
 else if (v.size() == 1) {
   t = (org.apache.lucene.analysis.Token) v.elementAt(0);
   return new TermQuery(new Term(field, t.termText()));
 } else {
   if (severalTokensAtSamePosition) {
     if (positionCount == 1) {
       // no phrase query:
       BooleanQuery q = new BooleanQuery(true);
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
         TermQuery currentQuery = new TermQuery(
             new Term(field, t.termText()));
         q.add(currentQuery, BooleanClause.Occur.SHOULD);
       }
       return q;
     }
     else {
       // phrase query:
       MultiPhraseQuery mpq = new MultiPhraseQuery();
       mpq.setSlop(phraseSlop);
       List multiTerms = new ArrayList();
       int position = -1;
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
if (t.getPositionIncrement() > 0 && multiTerms.size() >0) {
           if (enablePositionIncrements) {
mpq.add((Term[])multiTerms.toArray(newTerm[0]),position);
           } else {
             mpq.add((Term[])multiTerms.toArray(new Term[0]));
           }
           multiTerms.clear();
         }
         position += t.getPositionIncrement();
         multiTerms.add(new Term(field, t.termText()));
       }
       if (enablePositionIncrements) {
         mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
       } else {
         mpq.add((Term[])multiTerms.toArray(new Term[0]));
       }
       return mpq;
     }
   }
   else {
     PhraseQuery pq = new PhraseQuery();
     pq.setSlop(phraseSlop);
     int position = -1;
     for (int i = 0; i < v.size(); i++) {
       t = (org.apache.lucene.analysis.Token) v.elementAt(i);
       if (enablePositionIncrements) {
         position += t.getPositionIncrement();
         pq.add(new Term(field, t.termText()),position);
       } else {
         pq.add(new Term(field, t.termText()));
       }
     }
     return pq;
   }
 }
}


Here is an example that works around the deprecated code:
public void testShingleAnalyzerWrapperPhraseQuery() throwsException {Analyzer analyzer = new ShingleAnalyzerWrapper(newWhitespaceAnalyzer(), 2);
 searcher = setUpSearcher(analyzer);

 PhraseQuery q = new PhraseQuery();

 TokenStream ts = analyzer.tokenStream("content",
new StringReader("thissentence"));
 Token token;
 int j = -1;
 while ((token = ts.next()) != null) {
   j += token.getPositionIncrement();
String termText = new String(token.termBuffer(), 0,token.termLength());
   q.add(new Term("content", termText), j);
 }

 Hits hits = searcher.search(q);
 int[] ranks = new int[] { 0 };
 compareRanks(hits, ranks);
}

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token implementation

Reply via email to