[
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002491#comment-13002491
]
Uwe Schindler commented on SOLR-2400:
-------------------------------------
Stefan, this is an egeneral issue of TokenStreams adding Tokens. TokenStreams
that remove Tokens *should* automatically preserve position, but not even all
of those do that correctly (we were fixing some of them lately). The way of how
the Lucene analysis works makes it impossible to guarantee any corresponence of
the position numbers. Because for the indexer its only important what comes out
at the end, the steps inbetween are impossible. AnalysisReqHandler on the other
hand does some bad "hacks" to look "inside" the analysis (by using temporary
TokenStreams that buffer tokens), which are not the general use-case of
TokenStreams.
I wonder a little bit about your xml file, it only contains text and position,
but it should also contain rawTerm, startOffset, endOffset. When I call
analysis i get all of those attributes not only two of them. Is this a
hand-made file or what is the problem? Which Solr version?
One possibility to handle the thing might be the char offset in the original
text, because that one should point to the character offset of begin and end of
the token in the original stream instead of the token position, but this is
likely to break for lots of TokenFilters (WordDelimiterFilter would work as
long as you don't do stemming before...). The problem is incorrect handling of
offset calculation (also leading to bugs in highlighting) when the inserted
terms are longer than their originals.
Alltogether: Its unlikely that you can implement that and it will work for all
combinations of TokenStream components.
> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
> Key: SOLR-2400
> URL: https://issues.apache.org/jira/browse/SOLR-2400
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Reporter: Stefan Matheis (steffkes)
> Priority: Minor
> Attachments: 110303_FieldAnalysisRequestHandler_output.xml,
> 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information
> .. which could be very useful to build an nice Analysis-Output, and that's
> "Token-Relation" (if there is special/correct word for this, please correct
> me).
> Meaning, that is actually not possible to "follow" the Analysis-Process
> (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord)
> or split it into multiple Tokens (f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible
> to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) -
> short scribble attached
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]