Re: Multi-valued fields and TokenStream

Robert Muir Thu, 06 Nov 2014 12:21:06 -0800

Do the concatenation yourself with your own TokenStream. You can index
a field with a tokenstream for expert cases (the individual stored
values can be added separately)


No need to make the tokenstream API more complicated: its already very
complicated.

On Thu, Nov 6, 2014 at 3:13 PM, [email protected]
<[email protected]> wrote:
> Are you suggesting that DefaultIndexingChain.PerField.invert(boolean
> firstValue) would, prior to calling reset(), call
> setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is
> false?  Hmmmm.  I guess that would work, although it seems a bit hacky and
> it’s tying this to a specific attribute when ideally we notify the chain as
> a whole what’s going on.  But it doesn’t require any new API, save for some
> javadocs.  And it’s extremely unlikely there would be a
> backwards-incompatible problem, so that’s good.  And I find this use is
> related to positions so it’s not so bad to abuse the position increment for
> this.  Nice idea Steve; this works for me.
>
> Does anyone else have an opinion before I create an issue?
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
>
> On Thu, Nov 6, 2014 at 2:13 PM, Steve Rowe <[email protected]> wrote:
>>
>> Maybe the position increment gap would be useful?  If set to a value
>> larger than likely max position for any individual value, it could be used
>> to infer (non-)first-value-ness.
>>
>> > On Nov 5, 2014, at 1:03 PM, [email protected] wrote:
>> >
>> > Several times now, I’ve had to come up with work-arounds for a
>> > TokenStream not knowing it’s processing the first value or a
>> > subsequent-value of a multi-valued field.  Two of these times, the use-case
>> > was ensuring the first position of each value started at a multiple of 1000
>> > (or some other configurable value), and the third was encoding sentence
>> > paragraph counters (similar to a do-it-yourself position increment).
>> >
>> > The work-arounds are awkward and hacky.  For example if you’re in
>> > control of your Tokenizer, you can prefix subsequent values with a special
>> > flag, and then do the right think in reset().  But then the highlighter or
>> > value retrieval in general is impacted.  It’s also possible to create the
>> > fields with the constructor that accepts a TokenStream that you’ve told 
>> > it’s
>> > the first or subsequent value but it’s awkward going that route, and
>> > sometimes (e.g. Solr) it’s hard to know all the values you have up-front to
>> > even do that.
>> >
>> > It would be nice if TokenStream.reset() took a boolean ‘first’ argument.
>> > Such a change would obviously be backwards incompatible.  Simply 
>> > overloading
>> > the method to call the no-arg version is problematic because TokenStreams
>> > are a chain, and it would likely result in the chain getting doubly-reset.
>> >
>> > Any ideas?
>> >
>> > ~ David Smiley
>> > Freelance Apache Lucene/Solr Search Consultant/Developer
>> > http://www.linkedin.com/in/davidwsmiley
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Multi-valued fields and TokenStream

Reply via email to