I found that there's an open issue LUCENE-2191 about renaming reset(Reader) to setReader(), due to confusing APIs. I didn't see any buggy behavior mentioned there, but I sense that reset() and reset(Reader) confused other people too :). So perhaps this fix (if indeed others will agree it needs to be fixed) can happen under that issue.
Shai On Sun, May 13, 2012 at 11:32 AM, Shai Erera <[email protected]> wrote: > Hi > > Someone asked why the following tiny test does not work: > > String text = "Hello world1. Hello world2"; > Tokenizer tokenizer = WhitespaceTokenizer(Version.LUCENE_36, new > StringReader(text)); > int count = 0; > while (tokenizer.incrementToken()) { > count++; > } > assertEquals(4,count); > > // expecting reset() to do what it states: > tokenizer.reset(); > > count = 0; > while (tokenizer.incrementToken()) { > count++; > } > assertEquals(4,count); // HERE IT FAILS > > The answer was easy -- WhitespaceTokenizer (or Tokenizer) do not implement > reset() in any special way. But then when we reviewed the javadocs of > reset() in 3.6 and trunk, I became confused myself: > > In 3.6, it's mentioned that the method resets the stream to the beginning, > however it is optional and sub-classes may or may not implement this. > > In trunk it doesn't say 'optionally' but rather "Resets this stream to the > beginning". > > I'm not getting into text semantics, but rather want to ask why wouldn't > Tokenizer override reset() to call input.reset()? Reader has a reset() > method and some Readers, like StringReader, even implement it properly. We > can still say in the jdocs that reset() depends on the Reader.reset() > implementation and it may or may not reset the stream. > > I don't know if it's a bug that Tokenizer.reset() doesn't do that, or not > -- I'm sure someone like Robert or Uwe know the answer :). > > Alternatively, wouldn't it be better if reset() threw > UnsupportedOperationException in TokenStream and any other Tokenizer that > cannot support it? Otherwise, you have no way telling whether reset() is > supported or not, besides reading the Tokenizer's code. Maybe in trunk we > should just make TokenStream.reset() abstract? > > BTW, even reading the code does not always help -- WikipediaTokenizer does > implement reset() by calling super.reset() + scanner.reset(), but neither > call input.reset() ... so I'm not sure if WikiTokenizer is buggy or not (so > maybe it does help to read the code :)). > > Shai >
