Re: Searching number of tokens in text field

Erick Erickson Mon, 30 Dec 2019 05:08:03 -0800

This comes up occasionally, it’d be a neat thing to add to Solr if you’re 
motivated. It gets tricky though.

- part of the config would have to be the name of the length field to put the 
result into, that part’s easy.

- The trickier part is “when should the count be incremented?”. For instance, 
say you add 15 synonyms for a particular word. Would that add 1 or 16 to the 
count? What about WordDelimiterGraphFilterFactory, that can output N tokens in 
place of one. Do stopwords count? What about shingles? CJK languages? The list 
goes on.

If you tackle this I suggest you open a JIRA for discussion, probably a Lucene 
JIRA ‘cause the folks who deal with Lucene would have the best feedback. And 
probably ignore most of the possible interactions with other filters and 
document that most users should just put it immediately after the tokenizer and 
leave it at that ;)

I can think of a few other options, but about the only thing that I think makes 
sense is something like “countTokensInTheSamePosition=true|false” (there’s 
_GOT_ to be a better name for that!), defaulting to false so you could control 
whether synonym expansion and WDGFF insertions incremented the count or not. 
And I suspect that if you put such a filter after WDGFF, you’d also want to 
document that it should go after FlattenGraphFilterFactory, but trust any 
feedback on a Lucene JIRA over my suspicion...

Best,
Erick

> On Dec 29, 2019, at 7:57 PM, Matt Davis <[email protected]> wrote:
> 
> That is a clever idea.  I would still prefer something cleaner but this
> could work.  Thanks!
> 
> On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <[email protected]> wrote:
> 
>> I don't know of any pre-existing thing that does exactly this, but how
>> about a token filter that counts tokens (or positions maybe), and then
>> appends some special token encoding the length?
>> 
>> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> I was wondering if it is possible to search for the number of tokens in a
>>> text field.  For example find book titles with 3 or more words.  I don't
>>> mind adding a field that is the number of tokens to the search index but
>> I
>>> would like to avoid analyzing the text two times.   Can Lucene search for
>>> the number of tokens in a text field?  Or can I get the number of tokens
>>> after analysis and add it to the Lucene document before/during indexing?
>>> Or do I need to analysis the text myself and add the field to the
>> document
>>> (analyze the text twice, once myself, once in the IndexWriter).
>>> 
>>> Thanks,
>>> Matt Davis
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Searching number of tokens in text field

Reply via email to