[
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881477#comment-16881477
]
Michael Gibney commented on LUCENE-4312:
----------------------------------------
This sounds potentially like a good way to proceed. I appreciate the need for a
high bar for getting things into the index – I suppose I was invoking
"chicken/egg" not directly as an argument for inclusion in the index, but
rather to highlight the interdependence of these features.
Essentially all of the proof-of-concept work that we're discussing here is
already implemented as part of the LUCENE-7398 work, and has been running in
production (and iteratively improved) for over a year. Before proceeding, I'd
like to get some consensus on what the best way is to move forward, and also
perhaps have some discussion of what bar we have in mind for "once these prove
useful".
Regarding usefulness, and the question of to what extent this represents a
corner-case: anybody interested in index-time synonyms and precise positional
queries needs this feature. So in some sense this boils down to a question of
the usefulness of index-time synonyms (or other index-time TokenStream graphs)
... and since the standing recommendation has for some time been to _avoid_
using index-time synonyms, we have another chicken/egg :). I can say that this
has been considerably helpful in my use case, and the problem it addresses is
at the root of a number of consistently reported issues, among users of both
[Elasticsearch|https://discuss.elastic.co/t/not-getting-results-from-a-phrase-query-using-query-string-of-the-form-x-a1-abc-in-6-6-0/179191]
and
[Solr|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201905.mbox/%3CCAF%3DheHETPCqxUcqyu13tFfKFcALzD__-QrToRBP-VVWh1S3-Wg%40mail.gmail.com%3E].
Practically speaking, I'm wondering what's the best way to get the most eyes on
this feature set, with the goal of evaluating its usefulness and performance.
The fix as currently implemented is basically a wholesale rewrite of some of
the Spans classes, but it seeks to correctly support existing Spans contracts;
implemented as a branch, I was able to rely on existing tests against various
Spans. For performance reasons, changes were also introduced in indexing code
(e.g., DefaultIndexingChain). For these reasons, my sense is that it would be
quite challenging to extract these features into the sandbox module. Even if
such "sandbox" extraction were possible, it would render the task of evaluation
more difficult for all but the most dedicated users (currently it suffices to
run a forked build to swap out the backend Spans implementation in all of the
parsers and components that rely on Spans). Could these potentially be reasons
to opt for a "branch" approach (as opposed to "sandbox")?
> Index format to store position length per position
> --------------------------------------------------
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 6.0
> Reporter: Gang Luo
> Priority: Minor
> Labels: Suggestion
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and
> Codec APIs) to store an additional int position length per position.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]