Norms encode the number of tokens in the field, but in a lossy manner (1 byte by default), so you could probably create a custom query that filtered based on that, if you could tolerate the loss in precision? Or maybe change your norms storage to more precision?
You could use NormsFieldExistsQuery as a starting point for the sources for your custom query. Or maybe there's already a more similar Query based on norms? Mike McCandless http://blog.mikemccandless.com On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerick...@gmail.com> wrote: > This comes up occasionally, it’d be a neat thing to add to Solr if you’re > motivated. It gets tricky though. > > - part of the config would have to be the name of the length field to put > the result into, that part’s easy. > > - The trickier part is “when should the count be incremented?”. For > instance, say you add 15 synonyms for a particular word. Would that add 1 > or 16 to the count? What about WordDelimiterGraphFilterFactory, that can > output N tokens in place of one. Do stopwords count? What about shingles? > CJK languages? The list goes on. > > If you tackle this I suggest you open a JIRA for discussion, probably a > Lucene JIRA ‘cause the folks who deal with Lucene would have the best > feedback. And probably ignore most of the possible interactions with other > filters and document that most users should just put it immediately after > the tokenizer and leave it at that ;) > > I can think of a few other options, but about the only thing that I think > makes sense is something like “countTokensInTheSamePosition=true|false” > (there’s _GOT_ to be a better name for that!), defaulting to false so you > could control whether synonym expansion and WDGFF insertions incremented > the count or not. And I suspect that if you put such a filter after WDGFF, > you’d also want to document that it should go after > FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over my > suspicion... > > Best, > Erick > > > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics...@gmail.com> wrote: > > > > That is a clever idea. I would still prefer something cleaner but this > > could work. Thanks! > > > > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msoko...@gmail.com> > wrote: > > > >> I don't know of any pre-existing thing that does exactly this, but how > >> about a token filter that counts tokens (or positions maybe), and then > >> appends some special token encoding the length? > >> > >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics...@gmail.com> > wrote: > >> > >>> Hello, > >>> > >>> I was wondering if it is possible to search for the number of tokens > in a > >>> text field. For example find book titles with 3 or more words. I > don't > >>> mind adding a field that is the number of tokens to the search index > but > >> I > >>> would like to avoid analyzing the text two times. Can Lucene search > for > >>> the number of tokens in a text field? Or can I get the number of > tokens > >>> after analysis and add it to the Lucene document before/during > indexing? > >>> Or do I need to analysis the text myself and add the field to the > >> document > >>> (analyze the text twice, once myself, once in the IndexWriter). > >>> > >>> Thanks, > >>> Matt Davis > >>> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >