[jira] Commented: (SOLR-1954) Highlighter component should expose snippet character offsets and the score.

Hoss Man (JIRA) Fri, 18 Jun 2010 14:49:46 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880338#action_12880338
 ]


Hoss Man commented on SOLR-1954:
--------------------------------

bq. That way the highlighting section remains untouched, with extra stuff in a 
'highlighting-extended-info'

that really seems painful -- i think it would be a lot better to just come up 
with what the "new" structure should look like that's more flexible, populated 
it with more/less data based on what param the user asks for (ie: 
hl.positions=true) and then make this new structure the default for all future 
versions of solr.  Folks who don't want the new types of metadata, and don't 
want to change their clients to understand the new structure can add some param 
to their defaults to revert the format.  this is how we've dealt with several 
other changes in the past where we want the "default" behavior to be differnet 
for new users, but still support the old behavior for legacy users

(spellcheck.extendedResults may seem painful because it changes results -- but 
that's because it was never intended for you to toggle it on differnet requsts 
-- it's expected that you'll set it once and forget it -- the real problem is 
that it probably should have been made the default)

bq. The problem with offsets is.... what are the units? utf8 bytes, utf16 
units, real characters? 

1) isn't highlighting fairly fundamentally character based?  would you ever 
want/expect a highlight position to be based on bytes that break up a logical 
character?
2) being largely ignorant of highlighting, i would say the units should be in 
whatever the Highlighter currently use when indexing into string values -- my 
understanidng is that it's the same as the start/end offsets in tokens, so if 
they are char then it's char, if they are bytes, then it's bytes.

bq. Walter Underwood proposed a good idea of just alternating segments of text 
for highlighting.

I like that idea, and if structured properly it can still include the "score" 
for each matching chunk as metadata,  but some clients are still going to 
prefer offset metadata -- in particular the situation where i've got a 20MB 
text file in external storage and i want display the entire document with 
matches highlighted.  returning alternating strings isn't going to really going 
to help me unless they aren't truncated - at which point you are returning the 
entire 20MB doc (broken up in a bunch of distinct strings) instead of just 
returning a bunch of numbers i can use to find the corrisponding points in my 
local copy of the file.

> Highlighter component should expose snippet character offsets and the score.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1954
>                 URL: https://issues.apache.org/jira/browse/SOLR-1954
>             Project: Solr
>          Issue Type: New Feature
>          Components: highlighter
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: SOLR-1954_start_and_end_offsets.patch
>
>
> The Highlighter Component does not currently expose the snippet character 
> offsets nor the score.  There is a TODO in DefaultSolrHighlighter indicating 
> the intention to add this eventually.  This information is needed when doing 
> highlighting on external content.  The data is there so its pretty easy to 
> output it in some way.  The challenge is deciding on the output and its 
> ramifications on backwards compatibility.  The current highlighter component 
> response structure doesn't lend itself to adding any new data, unfortunately. 
>  I wish the original implementer had some foresight.  Unfortunately all the 
> highlighting tests assume this structure.  Here is a snippet of the current 
> response structure in Solr's sample data searching for "sdram" for reference:
> {code:xml}
> <lst name="highlighting">
>  <lst name="VS1GB400C3">
>   <arr name="text">
>       <str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt; 
> Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
>   </arr>
>  </lst>
> </lst>
> {code}
> Perhaps as a little hack, we introduce a pseudo field called 
> text_startCharOffset which is the concatenation of the matching field and 
> "_startCharOffset".  This would be an array of ints.  Likewise, there would 
> be another array for endCharOffset and score.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (SOLR-1954) Highlighter component should expose snippet character offsets and the score.

Reply via email to