Re: Deprecation of Token constructors in 2.4

Michael McCandless Mon, 01 Dec 2008 17:32:09 -0800

This was done under LUCENE-1333, in moving Token to the "reuse" API.


The general idea is to deprecate the String based APIs (whose
performance got worse because of the switch to char[]), in favor of
the char[] APIs.

That said, I do agree it's still useful to have String based variants,
at the expense of performance, for those cases that want convenience
and don't care much about performance.

However, we can't silently cause an existing API to have worse
performance (it's a break to back-compatability).  So I think we
should instead deprecate the old method and introduce a new one with a
clear warning about the performance penalty.  EG, we did this with
termText (now deprecated in favor of term()).  This way on upgrading,
people will have to confront the fact that its performance is now
worse than before.

I'm not sure what ctor we could add that could take String, and
doesn't already exist.

Further... the changes for LUCENE-1422 (committed to 2.9-dev) make
this question moot since the whole Token class is now deprecated.

Mike

Shai Erera wrote:

Hi
I moved to use Lucene 2.4 and noticed that some of the Tokenconstructors were marked deprecated. Specifically, I'm talking aboutthe Token(String, int, int), where the String is the word topopulate the token with, and the two ints are startOffset andendOffset respectively.That was an amazingly convenient constructor. I understand that itis discouraged to use it, and that the one that accepts a char[] isbetter, but there are cases, during indexing, where all you have athand is a String and not a char[].
For example, suppose that you want to add a certain token to adocument. You can do this by adding a Field with a TokenStream,where the TS will create a new Token(), populate it with the valueto add and return it.
Before 2.4, the code was simply:
return new Token(word, start, end);

After Lucene 2.4, the code looks like this:
Token t = new Token();
t.setTermBuffer(word, 0, word.length());
t.setStartOffset(start);
t.setEndOffset(end);
return t;
Instead of a one liner, I now have to write 5 (!) lines of codewhenever I want to do something like this. And ... the fact that Ican call setTermBuffer(String, int, int) (like I do in the 2nd lineof the code) does not prevent me from using Strings at all. The onlything that the deprecation of the constructor achieves iscomplicating the code developers need to write.
IMO, there is a huge difference between removing a convenient, yetinefficient, method, than simply document that it's discouraged touse it. After all, if I choose to use it at my own expense, I can doit and face whatever consequences there will be.
In fact, when I moved to use setTermBuffer, I actually introduced abug in my code. The reason is that setTermBuffer accepts to integerswhich specify the offset and length of the internal char[] of Token,rather than my start and end offset I used to use when I had theToken(String, int, int) constructor. That's confusing.
And if I raise the deprecation issue, the method termText() whichreturns a String was also a convenient (eventhough inefficient)method, for all kind of purposes amongst which are debugging orprinting. But not only that - Java makes use of String in so manyplaces, it's really hard to stay with a char[] for long, as soon asyou start involving Lucene code with other Java data structures. Soinstead of calling termText() (and knowing it's inefficient, andeven document it), I now have to write new String(t.termBuffer(), 0,t.termLength()).
I would like to ask the developers community - what is the strategyof deprecating methods? Again, we should document when certainmethods are inefficient, rather than deprecating them, and thusforcing people to write more cumbersome code.
A good example for a justified deprecation is IndexWriter.docCount()method, which recommends to use maxDoc() or numDocs() (if you wantto take into account deletions). There are two reasons it's justified:
1. Replacing docCount() with maxDoc() does not complicate my code.
2. docCount() is confusing (is it maxDoc() or numDocs()?).
On that notion, the deprecation of TokenStream.next() simply forcespeople who want to store the Tokens output TS to callTokenStream.next(new Token()). Not a huge inconvenience, but IMO anunnecessary deprecation.
I'm not sure this is new stuff to you, and perhaps if was evenraised on this list before. I'm also aware of the efforts tocompletely change the tokenization process in Lucene. However, Ithink that if we could undeprecate some of the deprecated methods,we'd do a great service to the developers community.
Shai



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Deprecation of Token constructors in 2.4

Reply via email to