Steve Rowe created LUCENE-6987:
----------------------------------
Summary: Clarify TokenStream workflow documentation
Key: LUCENE-6987
URL: https://issues.apache.org/jira/browse/LUCENE-6987
Project: Lucene - Core
Issue Type: Task
Reporter: Steve Rowe
On SOLR-4619, [~rcmuir] noted:
According to TokenStream's class javadocs:
{quote}
The workflow of the new TokenStream API is as follows:
1. Instantiation of TokenStream/TokenFilters which add/get attributes to/from
the AttributeSource.
2. The consumer calls reset().
3. The consumer retrieves attributes from the stream and stores local
references to all attributes it wants to access.
{quote}
So we have consumers (such as QueryBuilder) doing stuff out of order: if they
do step 3 before they do step 2.
My question is, can we detect this in tests? If MockAnalyzer can enforce it, it
is easier to fix it consistently everywhere. One idea is if MockTokenizer
deferred initializing its attributes until reset()? Its not going to be the
best (we need to tie it into its state machine logic somehow for that), but it
might be an easy step.
Also, majority of TokenFilters (which basically also serve as consumers too),
are doing step 3 before step 2 today. Most of them are just assigning to final
variables in their constructor.
So something is off: we gotta go one of two ways. Either fix the documentation
to swap step 3 before step 2 \[...], or we make a massive change to tons of
tokenizers (making them more complex and less efficient).
But I think we have to do something, at least we should fix the docs to be
clear, they need to reflect reality.
{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]