Or we could leave termText() deprecated, add term() which does the same thing sub-optimally (ie, always creates new String from the byte[]), and in the javadocs for termText() state that you can migrate either term() (if you really want a String and you understand the performance cost of doing so) or to the re-use APIs?

Mike

DM Smith wrote:

Michael McCandless wrote:

Maybe we should un-deprecate the termText() method but add javadocs explaining that for better performance you should use the char[] reuse methods instead?
I think so, too. Should we leave it as deprecated until 3.0? With the performance note and the encouragement to go for re-use, but also with a note that the current implementation is deprecated not the interface.

That's not quite what deprecated means. My thought on this is that it will give everyone a heads up that the current implementation is going away and that the replacement is sub-optimal.

(I use Eclipse and have it set to flag all deprecated uses. This helps me look for places to change.)

I think that this will make migration to 3.0 be much easier.

With this changing Term to add Term(String, Token) won't be necessary.

-- DM

Mike

DM Smith wrote:

Michael McCandless wrote:

DM Smith wrote:

Shouldn't Term have constructors that take a Token?

I think that makes sense, though normally Token appears during analysis and Term during searching (I think?) -- how often would you need to make a Term from a Token?

The problem I'm addressing is that tokens are used in contexts that need String and not char[].
The call to the deprecated
String termText = token.termText();
needs to be replaced with:
String termText = new String(token.termBuffer(), 0, token.termLength());

There are over 170 calls to token.termText(), each of these places have to be modified. In some, perhaps many, of these cases it may be possible to use char[] directly to get a performance gain.

In the case of Term changing it to work with char[] buffer, int start, int length, does not seem quite right. I think the ripple would keep getting bigger. But logically, the Term's text is the text of a Token.

To me it makes sense to have a method that returns the token as a String, but that method is deprecated and the suggested replacement is to directly use the buffer. So this leads to the above construct. Perhaps it would be good to add a new method and document that as one of two replacements.
public String term() {
return termText != null ? termText : new String(token.termBuffer(), 0, token.termLength());
}

Here is an example from QueryParser that has 5 instances, each calling the deprecated t.termText() method. In this example, there is the construction of a query from a token stream.
Each of the problem lines are of the pattern:
TermQuery currentQuery = new TermQuery(new Term(field, t.termText()));

To remove the deprecated call to t.termText(), the Token's buffer needs to be marshalled with something like: String termText = new String(token.termBuffer(), 0, token.termLength());
TermQuery currentQuery = new TermQuery(new Term(field, termText)));

/**
* @exception ParseException throw in overridden method to disallow
*/
protected Query getFieldQuery(String field, String queryText) throws ParseException { // Use the analyzer to get all the tokens, and then build a TermQuery,
 // PhraseQuery, or nothing based on the term count

TokenStream source = analyzer.tokenStream(field, new StringReader(queryText));
 Vector v = new Vector();
 org.apache.lucene.analysis.Token t;
 int positionCount = 0;
 boolean severalTokensAtSamePosition = false;

 while (true) {
   try {
     t = source.next();
   }
   catch (IOException e) {
     t = null;
   }
   if (t == null)
     break;
   v.addElement(t);
   if (t.getPositionIncrement() != 0)
     positionCount += t.getPositionIncrement();
   else
     severalTokensAtSamePosition = true;
 }
 try {
   source.close();
 }
 catch (IOException e) {
   // ignore
 }

 if (v.size() == 0)
   return null;
 else if (v.size() == 1) {
   t = (org.apache.lucene.analysis.Token) v.elementAt(0);
   return new TermQuery(new Term(field, t.termText()));
 } else {
   if (severalTokensAtSamePosition) {
     if (positionCount == 1) {
       // no phrase query:
       BooleanQuery q = new BooleanQuery(true);
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
         TermQuery currentQuery = new TermQuery(
             new Term(field, t.termText()));
         q.add(currentQuery, BooleanClause.Occur.SHOULD);
       }
       return q;
     }
     else {
       // phrase query:
       MultiPhraseQuery mpq = new MultiPhraseQuery();
       mpq.setSlop(phraseSlop);
       List multiTerms = new ArrayList();
       int position = -1;
       for (int i = 0; i < v.size(); i++) {
         t = (org.apache.lucene.analysis.Token) v.elementAt(i);
if (t.getPositionIncrement() > 0 && multiTerms.size() > 0) {
           if (enablePositionIncrements) {
mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
           } else {
             mpq.add((Term[])multiTerms.toArray(new Term[0]));
           }
           multiTerms.clear();
         }
         position += t.getPositionIncrement();
         multiTerms.add(new Term(field, t.termText()));
       }
       if (enablePositionIncrements) {
         mpq.add((Term[])multiTerms.toArray(new Term[0]),position);
       } else {
         mpq.add((Term[])multiTerms.toArray(new Term[0]));
       }
       return mpq;
     }
   }
   else {
     PhraseQuery pq = new PhraseQuery();
     pq.setSlop(phraseSlop);
     int position = -1;
     for (int i = 0; i < v.size(); i++) {
       t = (org.apache.lucene.analysis.Token) v.elementAt(i);
       if (enablePositionIncrements) {
         position += t.getPositionIncrement();
         pq.add(new Term(field, t.termText()),position);
       } else {
         pq.add(new Term(field, t.termText()));
       }
     }
     return pq;
   }
 }
}


Here is an example that works around the deprecated code:
public void testShingleAnalyzerWrapperPhraseQuery() throws Exception { Analyzer analyzer = new ShingleAnalyzerWrapper(new WhitespaceAnalyzer(), 2);
 searcher = setUpSearcher(analyzer);

 PhraseQuery q = new PhraseQuery();

 TokenStream ts = analyzer.tokenStream("content",
new StringReader("this sentence"));
 Token token;
 int j = -1;
 while ((token = ts.next()) != null) {
   j += token.getPositionIncrement();
String termText = new String(token.termBuffer(), 0, token.termLength());
   q.add(new Term("content", termText), j);
 }

 Hits hits = searcher.search(q);
 int[] ranks = new int[] { 0 };
 compareRanks(hits, ranks);
}

-- DM

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to