Re: Searching number of tokens in text field

Michael McCandless Thu, 02 Jan 2020 07:32:01 -0800

Norms encode the number of tokens in the field, but in a lossy manner (1
byte by default), so you could probably create a custom query that filtered
based on that, if you could tolerate the loss in precision?  Or maybe
change your norms storage to more precision?


You could use NormsFieldExistsQuery as a starting point for the sources for
your custom query.  Or maybe there's already a more similar Query based on
norms?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerick...@gmail.com>
wrote:

> This comes up occasionally, it’d be a neat thing to add to Solr if you’re
> motivated. It gets tricky though.
>
> - part of the config would have to be the name of the length field to put
> the result into, that part’s easy.
>
> - The trickier part is “when should the count be incremented?”. For
> instance, say you add 15 synonyms for a particular word. Would that add 1
> or 16 to the count? What about WordDelimiterGraphFilterFactory, that can
> output N tokens in place of one. Do stopwords count? What about shingles?
> CJK languages? The list goes on.
>
> If you tackle this I suggest you open a JIRA for discussion, probably a
> Lucene JIRA ‘cause the folks who deal with Lucene would have the best
> feedback. And probably ignore most of the possible interactions with other
> filters and document that most users should just put it immediately after
> the tokenizer and leave it at that ;)
>
> I can think of a few other options, but about the only thing that I think
> makes sense is something like “countTokensInTheSamePosition=true|false”
> (there’s _GOT_ to be a better name for that!), defaulting to false so you
> could control whether synonym expansion and WDGFF insertions incremented
> the count or not. And I suspect that if you put such a filter after WDGFF,
> you’d also want to document that it should go after
> FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over my
> suspicion...
>
> Best,
> Erick
>
> > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics...@gmail.com> wrote:
> >
> > That is a clever idea.  I would still prefer something cleaner but this
> > could work.  Thanks!
> >
> > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msoko...@gmail.com>
> wrote:
> >
> >> I don't know of any pre-existing thing that does exactly this, but how
> >> about a token filter that counts tokens (or positions maybe), and then
> >> appends some special token encoding the length?
> >>
> >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics...@gmail.com>
> wrote:
> >>
> >>> Hello,
> >>>
> >>> I was wondering if it is possible to search for the number of tokens
> in a
> >>> text field.  For example find book titles with 3 or more words.  I
> don't
> >>> mind adding a field that is the number of tokens to the search index
> but
> >> I
> >>> would like to avoid analyzing the text two times.   Can Lucene search
> for
> >>> the number of tokens in a text field?  Or can I get the number of
> tokens
> >>> after analysis and add it to the Lucene document before/during
> indexing?
> >>> Or do I need to analysis the text myself and add the field to the
> >> document
> >>> (analyze the text twice, once myself, once in the IndexWriter).
> >>>
> >>> Thanks,
> >>> Matt Davis
> >>>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Searching number of tokens in text field

Reply via email to