Re: Query with exact number of tokens

Steve Rowe Fri, 21 Sep 2018 09:05:17 -0700

Hi Sergio,

Chris “Hoss” Hostetter has a solution to this kind of problem here: 
https://lists.apache.org/thread.html/6b0f0cb864aa55f0a9eadfd92d27d374ab8deb16e8131ed2b7234463@%3Csolr-user.lucene.apache.org%3E
 . See also the suggestions in comments on SOLR-12673[1], which include a 
version of Hoss’ss solution.


Hoss’ss solution assumes a multivalued StrField with values counted using 
CountFieldValuesUpdateProcessorFactory, which doesn’t apply to you.  You could 
instead count unique tokens in an analyzed field using the 
StatelessScriptUpdateProcessorFactory[2][3], e.g. see slides 10&11 of Erik 
Hatcher’s Lucene/Solr Revolution 2013 talk[4].

Your script could look something like this (untested; replace "<field type>” 
with your field type):

=====
function getUniqueTokenCount(analyzer, fieldName, fieldValue) { 
  var uniqueTokens = {}; 
  var stream = analyzer.tokenStream(fieldName, fieldValue);
  var termAttr = 
stream.getAttribute(Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
  stream.reset();
  while (stream.incrementToken()) { uniqueTokens[termAttr.toString()] = 1; } 
  stream.end(); 
  stream.close(); 
  return Object.keys(uniqueTokens).length;
}
function processAdd(cmd) {
  var analyzer = req.getCore().getLatestSchema().getFieldTypeByName("<field 
type>").getIndexAnalyzer();
  doc.setField(“unique_token_count_i", getUniqueTokenCount(analyzer, null, 
content));
}
function processDelete(cmd) { }
function processMergeIndexes(cmd) { }
function processCommit(cmd) { }
function processRollback(cmd) { }
function finish() { }
=====

And your query could then look something like (replace "<field>” with your 
field name)[5][6]:

=====
fq={!frange l=0 
h=0}sub(unique_token_count_i,sum(termfreq(<field>,’CENTURY’),termfreq(<field>,’BANCORP’),termfreq(<field>,‘INC’)))
=====

Note that to construct the query ^^ you’ll need to tokenize and uniquify terms 
on the client side - if tokenization is non-trivial, you could use Solr's Field 
Analysis API[8] to perform tokenization for you.

[1] https://issues.apache.org/jira/browse/SOLR-12673 
[2] https://wiki.apache.org/solr/ScriptUpdateProcessor
[3] 
https://lucene.apache.org/solr/7_4_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
[4] https://www.slideshare.net/erikhatcher/solr-indexing-and-analysis-tricks
[5] 
https://lucene.apache.org/solr/guide/7_4/other-parsers.html#OtherParsers-FunctionRangeQueryParser
[6] 
https://lucene.apache.org/solr/guide/7_4/function-queries.html#termfreq-function
[7] https://lucene.apache.org/solr/guide/7_4/function-queries.html#sub-function
[8] 
https://lucene.apache.org/solr/guide/7_4/implicit-requesthandlers.html#analysis-handlers

--
Steve
www.lucidworks.com

> On Sep 21, 2018, at 10:45 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> A variant on Alexandre's approach is:
> at index time, count the tokens that will be produced yourself (this
> may be a little tricky, you shouldn't have WordDelimiterFilterFactory
> in your analysis for instance).
> Put the number of tokens in a separate field
> At query time, you'd search q=+company_name:(+century +bancorp +inc)
> +tokens_in_company_name_field:3
> 
> You don't need phrase queries with this approach, order doesn't matter.
> 
> It can get tricky though, should "CENTURY BANCORP, INC." and "CENTURY
> BANCORP, INCORPORATED." match?
> 
> Again, though, this means your indexing code has to do the same thing
> as your analysis chain. Which isn't very hard if the analysis chain is
> simple. I might use a char _filter_ factory to remove all
> non-alphanumeric characters, then a whitespace tokenizer and
> (probably) a lowercasefilter. That's pretty easy to replicate in order
> to count tokens.
> 
> Best,
> Erick
> On Fri, Sep 21, 2018 at 7:18 AM Alexandre Rafalovitch
> <arafa...@gmail.com> wrote:
>> 
>> I think you can match everything in the query to the field using either
>> 1) disMax/eDisMax with mm=100%
>> https://lucene.apache.org/solr/guide/7_4/the-dismax-query-parser.html#mm-minimum-should-match-parameter
>> 2) Complex Phrase Query Parser with inOrder=false:
>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>> 
>> The number of tokens though is hard. You only know what your tokens
>> are at the end of the indexing pipeline. And during search, the tokens
>> are looked up from their indexes and only then the documents are
>> looked up.
>> 
>> You may be able to do this with custom Postfilter that would run after
>> everything else to just reject records with extra tokens. That would
>> not be too expensive.
>> 
>> Or (possibly simpler way) you could try to precalculate things, by
>> writing a custom TokenFilter that takes a stream and returns token
>> count to be used as a copyField target. Then you send your query to
>> the same field with any full-query preserving syntax, either as a
>> phrase or as a field query parser:
>> https://lucene.apache.org/solr/guide/7_4/other-parsers.html#complex-phrase-query-parser
>> 
>> I would love to know if any/all of this works for you.
>> 
>> Regards,
>>   Alex.
>> 
>> On 21 September 2018 at 09:00, marotosg <marot...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I have to search for company names where my first requirement is to find
>>> only exact matches on the company name.
>>> 
>>> For instance if I search for "CENTURY BANCORP, INC." I shouldn't find "NEW
>>> CENTURY BANCORP, INC."
>>> because the result company has the extra keyword "NEW".
>>> 
>>> I can't use exact match because the sequence of tokens may differ. Basically
>>> I need to find results where the  tokens are the same in any order and the
>>> number of tokens match.
>>> 
>>> I have no idea if it's possible as include in the query the number of tokens
>>> and solr field has that info within to match it.
>>> 
>>> Thanks for your help
>>> Sergio
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Query with exact number of tokens

Reply via email to