[
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016580#comment-13016580
]
Uwe Schindler commented on SOLR-2400:
-------------------------------------
Hi Stefan,
sorry for missing your last response.
About the raw term: The raw term is only shown by solr currently, if the term
is only binary (like numerics) or similar (when the FieldType does some
transformation like with the deprecated Sortable*) fields. I just mentioned it
as example that I was missing some attributes in your example output. To solve
your problem it is of no use.
I already mentioned:
{quote}One possibility to handle the thing might be the char offset in the
original text, because that the req handler may use the character offset of
begin and end of the token in the original stream instead of the token
position, but this is likely to break for lots of TokenFilters
(WordDelimiterFilter would work as long as you don't do stemming before...).
The problem is incorrect handling of offset calculation (also leading to bugs
in highlighting) when the inserted terms are longer than their originals.{quote}
This might be your only chance (using the OffsetAttribute), but it is likely to
break. What you want to have is not possible with the analysis API of Lucene,
as some information is missing (as not needed during analysis - the absolute
positions are not important for the indexer, so TokenStreams don't preserve
them.
A possibility to preserve the original positions would be a trick in the
analysis RequestHandler: It could insert a Fake TokenFilter directly after the
Tokenizer, that adds an additional Attribute with the absolute position
(incremented on each call to input.incrementToken()). This could be a hack to
achieve what you want.
Maybe I can help you, but that needs some refactoring in
AnalysisRequestHandlers, but might be a good idea.
> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
> Key: SOLR-2400
> URL: https://issues.apache.org/jira/browse/SOLR-2400
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis
> Reporter: Stefan Matheis (steffkes)
> Priority: Minor
> Attachments: 110303_FieldAnalysisRequestHandler_output.xml,
> 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information
> .. which could be very useful to build an nice Analysis-Output, and that's
> "Token-Relation" (if there is special/correct word for this, please correct
> me).
> Meaning, that is actually not possible to "follow" the Analysis-Process
> (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord)
> or split it into multiple Tokens (f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible
> to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) -
> short scribble attached
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]