[
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887130#action_12887130
]
Stephen Weiss commented on SOLR-236:
------------------------------------
Oh Martijn, I hope you're reading. After a few months of calm we had some
OOM's again on our production servers. So I tried your latest patch with the
solr 1.4.1 release, since bundled in there are various fixes for memory leaks.
The performance difference is great - far less CPU and RAM usage all around.
But there's a catch! Something was introduced to change the "numFound" that is
reported. After we noticed this, I found your comment and removed these lines
from NonAdjacentDocumentCollapser.java:
+ if (collapsedGroupPriority.size() > maxNumberOfGroups) {
+ NonAdjacentCollapseGroup inferiorGroup =
collapsedGroupPriority.first();
+ collapsedDocs.remove(inferiorGroup.fieldValue);
+ collapsedGroupPriority.remove(inferiorGroup);
+ }
We did *NOT* remove line 99 as suggested because this caused compiler problems:
[javac]
/home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99:
cannot find symbol
[javac] symbol : variable collapseDoc
[javac] location: class
org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser
[javac] if (collapseDoc == null) {
After doing this, I noticed a *huge* performance drop - far worse than what we
had even with 1.4 and your patch from December. Searches were taking >10s to
complete (before we were just over 1s for the worst searches). So, I went back
and tried to find a way to get the "numFound" through other means - and I
figured I could just facet on the same field we're collapsing on, and then
count the number of facets. Looks good - the count of the facets is the right
count, and it would appear to be working.
But, there's a snag. It seems that the results being returned by your patch,
unaltered, are incorrect. For an example - my search for "orange" returns 7200
collapsed results, either using the real numFound from the altered patch, or
using the facet method wtih the new patch. This equates to 160 pages of
results. However, with the unaltered patch, if we actually try to retrieve
page 158, or really any result over 130 or so, we get the exact same results.
With the altered patch (removing those few lines), page 158 actually is page
158. Basically, it seems like your patch throws away good results - and I get
the feeling that it throws away those good results somewhere in those 5 lines.
Now, I'm stuck. I really don't know what to do... I don't want the OOMs to
continue, but it looks like they will regardless because both the old version
(1.4 + December patch) and the new, altered patched version are using too many
resources. But if I used the latest patch without changing it, I'm not getting
the right results all the way through.
Is there anything we can do? I appreciate your help... :-)
> Field collapsing
> ----------------
>
> Key: SOLR-236
> URL: https://issues.apache.org/jira/browse/SOLR-236
> Project: Solr
> Issue Type: New Feature
> Components: search
> Affects Versions: 1.3
> Reporter: Emmanuel Keller
> Assignee: Shalin Shekhar Mangar
> Fix For: Next
>
> Attachments: collapsing-patch-to-1.3.0-dieter.patch,
> collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch,
> collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java,
> field-collapse-3.patch, field-collapse-4-with-solrj.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
> field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
> field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch,
> field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff,
> field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
> NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java,
> quasidistributed.additional.patch, SOLR-236-1_4_1.patch,
> SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
> SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch,
> SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch,
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch,
> SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch,
> SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given
> field to a single entry in the result set. Site collapsing is a special case
> of this, where all results for a given web site is collapsed into one or two
> entries in the result set, typically with an associated "more documents from
> this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before
> collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]