[ 
https://issues.apache.org/jira/browse/SOLR-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663483#comment-13663483
 ] 

Hoss Man commented on SOLR-4824:
--------------------------------

I'm not very familiar with the FuzzyQuery code in question, but i believe what 
Jack is referring to is a limit in the number of _terms_ that fuzzy query will 
consider when it scans the indexed terms (via an automata i think?) looking for 
terms within a given edit distance of the input.

so it's not a matter of increasing documents that can cause the results to 
change, it's a matter of increasing the number of terms that are "close" to the 
term used in the fuzzy query.

{panel:title=Hoss'ss Uninformed example/speculation}
Assume for a moment, that you have a small index, where there are less then 50 
terms in the "text" field, and you ask for a fuzzy query matching "abcdefg"  
the list of "close" terms might be...

* abcdeff
* abcdegg
* abcdegf
* zbcdefg

...and there may be a total of 100 documents matching those 4 terms -- 1/2 of 
those matches may be because of the last term ("zbcdefg")

If you index a handful of additional documents, but those documents contain 
1000+ new terms in the "text" field which are very "close" to the input term, 
then the next time you do the same fuzzy quey, the expanded query might 
become...

* abcdeff
* abcdegg
* ...48 more terms that start with "abcd..."

And "zbcdefg" will be excluded from the expanded query, because the expansion 
code will stop looking for additional terms as soon as it finds 50 that are 
"close".

So now you will get results based on this new expansion, which may be less 
documents then were previously found.
{panel}



                
> Fuzzy / Faceting results are changed after ingestion of documents past a 
> certain number 
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-4824
>                 URL: https://issues.apache.org/jira/browse/SOLR-4824
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.2, 4.3
>         Environment: Ubuntu 12.04 LTS 12.04.2 
> jre1.7.0_17
> jboss-as-7.1.1.Final
>            Reporter: Lakshmi Venkataswamy
>
> In upgrading from SOLR 3.6 to 4.2/4.3 and comparing results on fuzzy queries, 
> I found that after a certain number of documents were ingested the fuzzy 
> query had drastically lower number of results.  We have approximately 18,000 
> documents per day and after ingesting approximately 40 days of documents, the 
> next incremental day of documents results in a lower number of results of a 
> fuzzy search.
> The query :  
> http://10.100.1.xx:8080/solr/corex/select?q=cc:worde~1&facet=on&facet.field=date&fl=date&facet.sort
> produces the following result before the threshold is crossed
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2349</int><lst name="params"><str 
> name="facet">on</str><str name="fl">date</str><str name="facet.sort"/>
> <str name="q">cc:worde~1</str><str 
> name="facet.field">date</str></lst></lst><result name="response" 
> numFound="362803" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst 
> name="facet_fields"><lst name="date">
> <int name="2012-12-31">2866</int>
> <int name="2013-01-01">11372</int>
> <int name="2013-01-02">11514</int>
> <int name="2013-01-03">12015</int>
> <int name="2013-01-04">11746</int>
> <int name="2013-01-05">10853</int>
> <int name="2013-01-06">11053</int>
> <int name="2013-01-07">11815</int>
> <int name="2013-01-08">11427</int>
> <int name="2013-01-09">11475</int>
> <int name="2013-01-10">11461</int>
> <int name="2013-01-11">12058</int>
> <int name="2013-01-12">11335</int>
> <int name="2013-01-13">12039</int>
> <int name="2013-01-14">12064</int>
> <int name="2013-01-15">12234</int>
> <int name="2013-01-16">12545</int>
> <int name="2013-01-17">11766</int>
> <int name="2013-01-18">12197</int>
> <int name="2013-01-19">11414</int>
> <int name="2013-01-20">11633</int>
> <int name="2013-01-21">12863</int>
> <int name="2013-01-22">12378</int>
> <int name="2013-01-23">11947</int>
> <int name="2013-01-24">11822</int>
> <int name="2013-01-25">11882</int>
> <int name="2013-01-26">10474</int>
> <int name="2013-01-27">11051</int>
> <int name="2013-01-28">11776</int>
> <int name="2013-01-29">11957</int>
> <int name="2013-01-30">11260</int>
> <int name="2013-01-31">8511</int>
> </lst></lst><lst name="facet_dates"/><lst 
> name="facet_ranges"/></lst></response>
> Once the 40 days of documents ingested threshold is crossed the results drop 
> as show below for the same query
> <response><lst name="responseHeader">
> <int name="status">0</int><int name="QTime">2</int><lst name="params"><str 
> name="facet">on</str><str name="fl">date</str><str name="facet.sort"/><str 
> name="q">cc:worde~1</str><str name="facet.field">date</str></lst></lst>
> <result name="response" numFound="1338" start="0"></result>
> <lst name="facet_counts"><lst name="facet_queries"/><lst 
> name="facet_fields"><lst name="date">
> <int name="2012-12-31">0</int>
> <int name="2013-01-01">41</int>
> <int name="2013-01-02">21</int>
> <int name="2013-01-03">24</int>
> <int name="2013-01-04">19</int>
> <int name="2013-01-05">9</int>
> <int name="2013-01-06">11</int>
> <int name="2013-01-07">17</int>
> <int name="2013-01-08">14</int>
> <int name="2013-01-09">24</int>
> <int name="2013-01-10">43</int>
> <int name="2013-01-11">14</int>
> <int name="2013-01-12">52</int>
> <int name="2013-01-13">57</int>
> <int name="2013-01-14">25</int>
> <int name="2013-01-15">17</int>
> <int name="2013-01-16">34</int>
> <int name="2013-01-17">11</int>
> <int name="2013-01-18">16</int>
> <int name="2013-01-19">121</int>
> <int name="2013-01-20">33</int>
> <int name="2013-01-21">26</int>
> <int name="2013-01-22">59</int>
> <int name="2013-01-23">27</int>
> <int name="2013-01-24">10</int>
> <int name="2013-01-25">9</int>
> <int name="2013-01-26">6</int>
> <int name="2013-01-27">16</int>
> <int name="2013-01-28">11</int>
> <int name="2013-01-29">15</int>
> <int name="2013-01-30">21</int>
> <int name="2013-01-31">109</int>
> <int name="2013-02-01">11</int>
> <int name="2013-02-02">7</int>
> <int name="2013-02-03">10</int>
> <int name="2013-02-04">8</int>
> <int name="2013-02-05">13</int>
> <int name="2013-02-06">75</int>
> <int name="2013-02-07">77</int>
> <int name="2013-02-08">31</int>
> <int name="2013-02-09">35</int>
> <int name="2013-02-10">22</int>
> <int name="2013-02-11">18</int>
> <int name="2013-02-12">11</int>
> <int name="2013-02-13">68</int>
> <int name="2013-02-14">40</int>
> </lst></lst><lst name="facet_dates"/><lst 
> name="facet_ranges"/></lst></response>
> I have also tested this with different months of data and have seen the same 
> issue  around the number of documents.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to