[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

Uwe Schindler (JIRA) Thu, 03 Mar 2011 23:42:04 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002491#comment-13002491
 ]


Uwe Schindler commented on SOLR-2400:
-------------------------------------

Stefan, this is an egeneral issue of TokenStreams adding Tokens. TokenStreams 
that remove Tokens *should* automatically preserve position, but not even all 
of those do that correctly (we were fixing some of them lately). The way of how 
the Lucene analysis works makes it impossible to guarantee any corresponence of 
the position numbers. Because for the indexer its only important what comes out 
at the end, the steps inbetween are impossible. AnalysisReqHandler on the other 
hand does some bad "hacks" to look "inside" the analysis (by using temporary 
TokenStreams that buffer tokens), which are not the general use-case of 
TokenStreams.

I wonder a little bit about your xml file, it only contains text and position, 
but it should also contain rawTerm, startOffset, endOffset. When I call 
analysis i get all of those attributes not only two of them. Is this a 
hand-made file or what is the problem? Which Solr version?

One possibility to handle the thing might be the char offset in the original 
text, because that one should point to the character offset of begin and end of 
the token in the original stream instead of the token position, but this is 
likely to break for lots of TokenFilters (WordDelimiterFilter would work as 
long as you don't do stemming before...). The problem is incorrect handling of 
offset calculation (also leading to bugs in highlighting) when the inserted 
terms are longer than their originals.

Alltogether: Its unlikely that you can implement that and it will work for all 
combinations of TokenStream components.

> FieldAnalysisRequestHandler; add information about token-relation
> -----------------------------------------------------------------
>
>                 Key: SOLR-2400
>                 URL: https://issues.apache.org/jira/browse/SOLR-2400
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Stefan Matheis (steffkes)
>            Priority: Minor
>         Attachments: 110303_FieldAnalysisRequestHandler_output.xml, 
> 110303_FieldAnalysisRequestHandler_view.png
>
>
> The XML-Output (simplified example attached) is missing one small information 
> .. which could be very useful to build an nice Analysis-Output, and that's 
> "Token-Relation" (if there is special/correct word for this, please correct 
> me).
> Meaning, that is actually not possible to "follow" the Analysis-Process 
> (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) 
> or split it into multiple Tokens (f.e. WordDelimiter).
> Would it be possible to include this Information? If so, it would be possible 
> to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - 
> short scribble attached

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-2400) FieldAnalysisRequestHandler; add information about token-relation

Reply via email to