This was done under LUCENE-1333, in moving Token to the "reuse" API.
The general idea is to deprecate the String based APIs (whose
performance got worse because of the switch to char[]), in favor of
the char[] APIs.
That said, I do agree it's still useful to have String based variants,
at the expense of performance, for those cases that want convenience
and don't care much about performance.
However, we can't silently cause an existing API to have worse
performance (it's a break to back-compatability). So I think we
should instead deprecate the old method and introduce a new one with a
clear warning about the performance penalty. EG, we did this with
termText (now deprecated in favor of term()). This way on upgrading,
people will have to confront the fact that its performance is now
worse than before.
I'm not sure what ctor we could add that could take String, and
doesn't already exist.
Further... the changes for LUCENE-1422 (committed to 2.9-dev) make
this question moot since the whole Token class is now deprecated.
Mike
Shai Erera wrote:
Hi
I moved to use Lucene 2.4 and noticed that some of the Token
constructors were marked deprecated. Specifically, I'm talking about
the Token(String, int, int), where the String is the word to
populate the token with, and the two ints are startOffset and
endOffset respectively.
That was an amazingly convenient constructor. I understand that it
is discouraged to use it, and that the one that accepts a char[] is
better, but there are cases, during indexing, where all you have at
hand is a String and not a char[].
For example, suppose that you want to add a certain token to a
document. You can do this by adding a Field with a TokenStream,
where the TS will create a new Token(), populate it with the value
to add and return it.
Before 2.4, the code was simply:
return new Token(word, start, end);
After Lucene 2.4, the code looks like this:
Token t = new Token();
t.setTermBuffer(word, 0, word.length());
t.setStartOffset(start);
t.setEndOffset(end);
return t;
Instead of a one liner, I now have to write 5 (!) lines of code
whenever I want to do something like this. And ... the fact that I
can call setTermBuffer(String, int, int) (like I do in the 2nd line
of the code) does not prevent me from using Strings at all. The only
thing that the deprecation of the constructor achieves is
complicating the code developers need to write.
IMO, there is a huge difference between removing a convenient, yet
inefficient, method, than simply document that it's discouraged to
use it. After all, if I choose to use it at my own expense, I can do
it and face whatever consequences there will be.
In fact, when I moved to use setTermBuffer, I actually introduced a
bug in my code. The reason is that setTermBuffer accepts to integers
which specify the offset and length of the internal char[] of Token,
rather than my start and end offset I used to use when I had the
Token(String, int, int) constructor. That's confusing.
And if I raise the deprecation issue, the method termText() which
returns a String was also a convenient (eventhough inefficient)
method, for all kind of purposes amongst which are debugging or
printing. But not only that - Java makes use of String in so many
places, it's really hard to stay with a char[] for long, as soon as
you start involving Lucene code with other Java data structures. So
instead of calling termText() (and knowing it's inefficient, and
even document it), I now have to write new String(t.termBuffer(), 0,
t.termLength()).
I would like to ask the developers community - what is the strategy
of deprecating methods? Again, we should document when certain
methods are inefficient, rather than deprecating them, and thus
forcing people to write more cumbersome code.
A good example for a justified deprecation is IndexWriter.docCount()
method, which recommends to use maxDoc() or numDocs() (if you want
to take into account deletions). There are two reasons it's justified:
1. Replacing docCount() with maxDoc() does not complicate my code.
2. docCount() is confusing (is it maxDoc() or numDocs()?).
On that notion, the deprecation of TokenStream.next() simply forces
people who want to store the Tokens output TS to call
TokenStream.next(new Token()). Not a huge inconvenience, but IMO an
unnecessary deprecation.
I'm not sure this is new stuff to you, and perhaps if was even
raised on this list before. I'm also aware of the efforts to
completely change the tokenization process in Lucene. However, I
think that if we could undeprecate some of the deprecated methods,
we'd do a great service to the developers community.
Shai
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]