[
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663483#comment-13663483
]
Hoss Man commented on SOLR-4824:
--------------------------------
I'm not very familiar with the FuzzyQuery code in question, but i believe what
Jack is referring to is a limit in the number of _terms_ that fuzzy query will
consider when it scans the indexed terms (via an automata i think?) looking for
terms within a given edit distance of the input.
so it's not a matter of increasing documents that can cause the results to
change, it's a matter of increasing the number of terms that are "close" to the
term used in the fuzzy query.
{panel:title=Hoss'ss Uninformed example/speculation}
Assume for a moment, that you have a small index, where there are less then 50
terms in the "text" field, and you ask for a fuzzy query matching "abcdefg"
the list of "close" terms might be...
* abcdeff
* abcdegg
* abcdegf
* zbcdefg
...and there may be a total of 100 documents matching those 4 terms -- 1/2 of
those matches may be because of the last term ("zbcdefg")
If you index a handful of additional documents, but those documents contain
1000+ new terms in the "text" field which are very "close" to the input term,
then the next time you do the same fuzzy quey, the expanded query might
become...
* abcdeff
* abcdegg
* ...48 more terms that start with "abcd..."
And "zbcdefg" will be excluded from the expanded query, because the expansion
code will stop looking for additional terms as soon as it finds 50 that are
"close".
So now you will get results based on this new expansion, which may be less
documents then were previously found.
{panel}
> Fuzzy / Faceting results are changed after ingestion of documents past a
> certain number
> ----------------------------------------------------------------------------------------
>
> Key: SOLR-4824
> URL: https://issues.apache.org/jira/browse/SOLR-4824
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.2, 4.3
> Environment: Ubuntu 12.04 LTS 12.04.2
> jre1.7.0_17
> jboss-as-7.1.1.Final
> Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries,
> I found that after a certain number of documents were ingested the fuzzy
> query had drastically lower number of results. We have approximately 18,000
> documents per day and after ingesting approximately 40 days of documents, the
> next incremental day of documents results in a lower number of results of a
> fuzzy search.
> The query :
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2349</int><lst name="params"><str
> name="facet">on</str><str name="fl">date</str><str name="facet.sort"/>
> <str name="q">cc:worde~1</str><str
> name="facet.field">date</str></lst></lst><result name="response"
> numFound="362803" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst
> name="facet_fields"><lst name="date">
> <int name="2012-12-31">2866</int>
> <int name="2013-01-01">11372</int>
> <int name="2013-01-02">11514</int>
> <int name="2013-01-03">12015</int>
> <int name="2013-01-04">11746</int>
> <int name="2013-01-05">10853</int>
> <int name="2013-01-06">11053</int>
> <int name="2013-01-07">11815</int>
> <int name="2013-01-08">11427</int>
> <int name="2013-01-09">11475</int>
> <int name="2013-01-10">11461</int>
> <int name="2013-01-11">12058</int>
> <int name="2013-01-12">11335</int>
> <int name="2013-01-13">12039</int>
> <int name="2013-01-14">12064</int>
> <int name="2013-01-15">12234</int>
> <int name="2013-01-16">12545</int>
> <int name="2013-01-17">11766</int>
> <int name="2013-01-18">12197</int>
> <int name="2013-01-19">11414</int>
> <int name="2013-01-20">11633</int>
> <int name="2013-01-21">12863</int>
> <int name="2013-01-22">12378</int>
> <int name="2013-01-23">11947</int>
> <int name="2013-01-24">11822</int>
> <int name="2013-01-25">11882</int>
> <int name="2013-01-26">10474</int>
> <int name="2013-01-27">11051</int>
> <int name="2013-01-28">11776</int>
> <int name="2013-01-29">11957</int>
> <int name="2013-01-30">11260</int>
> <int name="2013-01-31">8511</int>
> </lst></lst><lst name="facet_dates"/><lst
> name="facet_ranges"/></lst></response>
> Once the 40 days of documents ingested threshold is crossed the results drop
> as show below for the same query
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2</int><lst name="params"><str
> name="facet">on</str><str name="fl">date</str><str name="facet.sort"/><str
> name="q">cc:worde~1</str><str name="facet.field">date</str></lst></lst>
> <result name="response" numFound="1338" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst
> name="facet_fields"><lst name="date">
> <int name="2012-12-31">0</int>
> <int name="2013-01-01">41</int>
> <int name="2013-01-02">21</int>
> <int name="2013-01-03">24</int>
> <int name="2013-01-04">19</int>
> <int name="2013-01-05">9</int>
> <int name="2013-01-06">11</int>
> <int name="2013-01-07">17</int>
> <int name="2013-01-08">14</int>
> <int name="2013-01-09">24</int>
> <int name="2013-01-10">43</int>
> <int name="2013-01-11">14</int>
> <int name="2013-01-12">52</int>
> <int name="2013-01-13">57</int>
> <int name="2013-01-14">25</int>
> <int name="2013-01-15">17</int>
> <int name="2013-01-16">34</int>
> <int name="2013-01-17">11</int>
> <int name="2013-01-18">16</int>
> <int name="2013-01-19">121</int>
> <int name="2013-01-20">33</int>
> <int name="2013-01-21">26</int>
> <int name="2013-01-22">59</int>
> <int name="2013-01-23">27</int>
> <int name="2013-01-24">10</int>
> <int name="2013-01-25">9</int>
> <int name="2013-01-26">6</int>
> <int name="2013-01-27">16</int>
> <int name="2013-01-28">11</int>
> <int name="2013-01-29">15</int>
> <int name="2013-01-30">21</int>
> <int name="2013-01-31">109</int>
> <int name="2013-02-01">11</int>
> <int name="2013-02-02">7</int>
> <int name="2013-02-03">10</int>
> <int name="2013-02-04">8</int>
> <int name="2013-02-05">13</int>
> <int name="2013-02-06">75</int>
> <int name="2013-02-07">77</int>
> <int name="2013-02-08">31</int>
> <int name="2013-02-09">35</int>
> <int name="2013-02-10">22</int>
> <int name="2013-02-11">18</int>
> <int name="2013-02-12">11</int>
> <int name="2013-02-13">68</int>
> <int name="2013-02-14">40</int>
> </lst></lst><lst name="facet_dates"/><lst
> name="facet_ranges"/></lst></response>
> I have also tested this with different months of data and have seen the same
> issue around the number of documents.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]