[jira] [Commented] (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

Uwe Schindler (JIRA) Wed, 06 Apr 2011 15:18:45 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016580#comment-13016580
 ]


Uwe Schindler commented on SOLR-2400:
-------------------------------------

Hi Stefan,

sorry for missing your last response.

About the raw term: The raw term is only shown by solr currently, if the term 
is only binary (like numerics) or similar (when the FieldType does some 
transformation like with the deprecated Sortable*) fields. I just mentioned it 
as example that I was missing some attributes in your example output. To solve 
your problem it is of no use.

I already mentioned:
{quote}One possibility to handle the thing might be the char offset in the 
original text, because that the req handler may use the character offset of 
begin and end of the token in the original stream instead of the token 
position, but this is likely to break for lots of TokenFilters 
(WordDelimiterFilter would work as long as you don't do stemming before...). 
The problem is incorrect handling of offset calculation (also leading to bugs 
in highlighting) when the inserted terms are longer than their originals.{quote}

This might be your only chance (using the OffsetAttribute), but it is likely to 
break. What you want to have is not possible with the analysis API of Lucene, 
as some information is missing (as not needed during analysis - the absolute 
positions are not important for the indexer, so TokenStreams don't preserve 
them.

A possibility to preserve the original positions would be a trick in the 
analysis RequestHandler: It could insert a Fake TokenFilter directly after the 
Tokenizer, that adds an additional Attribute with the absolute position 
(incremented on each call to input.incrementToken()). This could be a hack to 
achieve what you want.

Maybe I can help you, but that needs some refactoring in 
AnalysisRequestHandlers, but might be a good idea.

> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
>                 Key: SOLR-2400
>                 URL: https://issues.apache.org/jira/browse/SOLR-2400
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Stefan Matheis (steffkes)
>            Priority: Minor
>         Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
> 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information 
> .. which could be very useful to build an nice Analysis-Output, and that's 
> "Token-Relation" (if there is special/correct word for this, please correct 
> me).
> Meaning, that is actually not possible to "follow" the Analysis-Process 
> (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
> or split it into multiple Tokens (f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible 
> to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
> short scribble attached

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

Reply via email to