[
https://issues.apache.org/jira/browse/LUCENE-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608354#comment-14608354
]
Adrien Grand commented on LUCENE-6212:
--------------------------------------
bq. There are perfectly valid use cases to use a different Analyzer at query
time rather than indexing time
This change doesn't force you to use the same analyzer at index time and search
time, just to always use the same analyzer at index time.
bq. it's also possible to have text of different sources which has been
pre-processed in different ways, so needs to be tokenized differently to get a
consistent output
One way that this feature was misused was to handle multi-lingual content, but
this would break term statistics as different words could be filtered to the
same stem and a single word could be filtered to two different stems depending
on the language. In general, if different analysis chains are required, it's
better to just use different fields or even different indices.
> Remove IndexWriter's per-document analyzer add/updateDocument APIs
> ------------------------------------------------------------------
>
> Key: LUCENE-6212
> URL: https://issues.apache.org/jira/browse/LUCENE-6212
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 5.0, 5.1, Trunk
>
> Attachments: LUCENE-6212.patch
>
>
> IndexWriter already takes an analyzer up-front (via
> IndexWriterConfig), but it also allows you to specify a different one
> for each add/updateDocument.
> I think this is quite dangerous/trappy since it means you can easily
> index tokens for that document that don't match at search-time based
> on the search-time analyzer.
> I think we should remove this trap in 5.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]