[jira] [Commented] (LUCENE-4312) Index format to store position length per position

Michael Gibney (JIRA) Tue, 09 Jul 2019 12:03:11 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881477#comment-16881477
 ]


Michael Gibney commented on LUCENE-4312:
----------------------------------------

This sounds potentially like a good way to proceed. I appreciate the need for a 
high bar for getting things into the index – I suppose I was invoking 
"chicken/egg" not directly as an argument for inclusion in the index, but 
rather to highlight the interdependence of these features.

Essentially all of the proof-of-concept work that we're discussing here is 
already implemented as part of the LUCENE-7398 work, and has been running in 
production (and iteratively improved) for over a year. Before proceeding, I'd 
like to get some consensus on what the best way is to move forward, and also 
perhaps have some discussion of what bar we have in mind for "once these prove 
useful".

Regarding usefulness, and the question of to what extent this represents a 
corner-case: anybody interested in index-time synonyms and precise positional 
queries needs this feature. So in some sense this boils down to a question of 
the usefulness of index-time synonyms (or other index-time TokenStream graphs) 
... and since the standing recommendation has for some time been to _avoid_ 
using index-time synonyms, we have another chicken/egg :). I can say that this 
has been considerably helpful in my use case, and the problem it addresses is 
at the root of a number of consistently reported issues, among users of both 
[Elasticsearch|https://discuss.elastic.co/t/not-getting-results-from-a-phrase-query-using-query-string-of-the-form-x-a1-abc-in-6-6-0/179191]
 and 
[Solr|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201905.mbox/%3CCAF%3DheHETPCqxUcqyu13tFfKFcALzD__-QrToRBP-VVWh1S3-Wg%40mail.gmail.com%3E].

Practically speaking, I'm wondering what's the best way to get the most eyes on 
this feature set, with the goal of evaluating its usefulness and performance. 
The fix as currently implemented is basically a wholesale rewrite of some of 
the Spans classes, but it seeks to correctly support existing Spans contracts; 
implemented as a branch, I was able to rely on existing tests against various 
Spans. For performance reasons, changes were also introduced in indexing code 
(e.g., DefaultIndexingChain). For these reasons, my sense is that it would be 
quite challenging to extract these features into the sandbox module. Even if 
such "sandbox" extraction were possible, it would render the task of evaluation 
more difficult for all but the most dedicated users (currently it suffices to 
run a forked build to swap out the backend Spans implementation in all of the 
parsers and components that rely on Spans). Could these potentially be reasons 
to opt for a "branch" approach (as opposed to "sandbox")?

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4312) Index format to store position length per position

Reply via email to