The OpenSearchServlet has a hardcoding of 'site' as the field to use deduping search results. I'd like to be able to dedup search results on fields other than just 'site'. For example, we have collections that may have multiple instances of an url in the index. For such collections, sometimes we want queries to turn up all instances of the url in the search results. For other query types, we only want one instance of a particular url showing in the search results. I can prevent the duplicates showing by running a query w/ search results deduped on the 'url' field.

Attached is a suggested patch. New query parameters 'dedupField' and 'hitsPerDup' are introduced. 'dedupField' allows specifying a field other than 'site' for deduping (It defaults to 'site'). 'hitsPerDup' subsumes 'hitsPerSite' (If 'hitsPerSite' is present, and there is no hitsPerDup in the query, we'll take the 'hitsPerSite' as 'hitsPerDup' value).

If the patch is amenable, should I work up a matching patch for search.jsp?

Good stuff,
St.Ack

P.S. Query parameter names are taken from names of NutchBean params passed on the search method. Perhaps 'hitsPerDup' should be 'hitsPerDedup'?
Index: src/java/org/apache/nutch/searcher/NutchBean.java
===================================================================
--- src/java/org/apache/nutch/searcher/NutchBean.java	(revision 179304)
+++ src/java/org/apache/nutch/searcher/NutchBean.java	(working copy)
@@ -154,7 +154,8 @@
    * 
    * @param query query
    * @param numHits number of requested hits
-   * @param maxHitsPerDup the maximum hits returned with matching values, or zero
+   * @param maxHitsPerDup the maximum hits returned with matching values, or
+   * zero
    * @return Hits the matching hits
    * @throws IOException
    */
@@ -171,7 +172,8 @@
    * 
    * @param query query
    * @param numHits number of requested hits
-   * @param maxHitsPerDup the maximum hits returned with matching values, or zero
+   * @param maxHitsPerDup the maximum hits returned with matching values, or
+   * zero
    * @param dedupField field name to check for duplicates
    * @return Hits the matching hits
    * @throws IOException
@@ -189,8 +191,11 @@
    * 
    * @param query query
    * @param numHits number of requested hits
-   * @param maxHitsPerDup the maximum hits returned with matching values, or zero
+   * @param maxHitsPerDup the maximum hits returned with matching values, or
+   * zero
    * @param dedupField field name to check for duplicates
+   * @param sortField Field to sort on (or null if no sorting).
+   * @param reverse True if we are to reverse sort by <code>sortField</code>.
    * @return Hits the matching hits
    * @throws IOException
    */
Index: src/java/org/apache/nutch/searcher/OpenSearchServlet.java
===================================================================
--- src/java/org/apache/nutch/searcher/OpenSearchServlet.java	(revision 179304)
+++ src/java/org/apache/nutch/searcher/OpenSearchServlet.java	(working copy)
@@ -93,23 +93,42 @@
     if (hitsString != null)
       hitsPerPage = Integer.parseInt(hitsString);
 
-    int hitsPerSite = 2;                          // max hits per site
-    String hitsPerSiteString = request.getParameter("hitsPerSite");
-    if (hitsPerSiteString != null)
-      hitsPerSite = Integer.parseInt(hitsPerSiteString);
-
     String sort = request.getParameter("sort");
     boolean reverse =
       sort!=null && "true".equals(request.getParameter("reverse"));
 
+    // De-Duplicate handling.  Look for duplicates field and for how many
+    // duplicates per results to return. Default duplicates field is 'site'
+    // and duplicates per results default is '2'.
+    String dedupField = request.getParameter("dedupField");
+    if (dedupField == null || dedupField.length() == 0) {
+        dedupField = "site";
+    }
+    int hitsPerDup = 2;
+    String hitsPerDupString = request.getParameter("hitsPerDup");
+    if (hitsPerDupString != null && hitsPerDupString.length() > 0) {
+        hitsPerDup = Integer.parseInt(hitsPerDupString);
+    } else {
+        // If 'hitsPerSite' present, use that value.
+        String hitsPerSiteString = request.getParameter("hitsPerSite");
+        if (hitsPerSiteString != null && hitsPerSiteString.length() > 0) {
+            hitsPerDup = Integer.parseInt(hitsPerSiteString);
+        }
+    }
+     
+    // Make up query string for use later drawing the 'rss' logo.
+    String params = "&hitsPerPage=" + hitsPerPage +
+        (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") +
+        (dedupField == null ? "" : "&dedupField=" + dedupField));
+
     Query query = Query.parse(queryString);
     bean.LOG.info("query: " + queryString);
 
     // execute the query
     Hits hits;
     try {
-      hits = bean.search(query, start + hitsPerPage, hitsPerSite, "site",
-                         sort, reverse);
+      hits = bean.search(query, start + hitsPerPage, hitsPerDup, dedupField,
+          sort, reverse);
     } catch (IOException e) {
       bean.LOG.log(Level.WARNING, "Search Error", e);
       hits = new Hits(0,new Hit[0]);	
@@ -127,8 +146,6 @@
 
     String requestUrl = request.getRequestURL().toString();
     String base = requestUrl.substring(0, requestUrl.lastIndexOf('/'));
-    String params = "&hitsPerPage="+hitsPerPage
-      +(sort==null ? "" : "&sort="+sort+(reverse?"&reverse=true":""));
       
 
     try {
@@ -151,7 +168,7 @@
               base+"/search.jsp"
               +"?query="+urlQuery
               +"&start="+start
-              +"&hitsPerSite="+hitsPerSite
+              +"&hitsPerDup="+hitsPerDup
               +params);
 
       addNode(doc, channel, "opensearch", "totalResults", ""+hits.getTotal());
@@ -166,14 +183,14 @@
         addNode(doc, channel, "nutch", "nextPage", requestUrl
                 +"?query="+urlQuery
                 +"&start="+end
-                +"&hitsPerSite="+hitsPerSite
+                +"&hitsPerDup="+hitsPerDup
                 +params);
       }
 
       if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) {
         addNode(doc, channel, "nutch", "showAllHits", requestUrl
                 +"?query="+urlQuery
-                +"&hitsPerSite="+0
+                +"&hitsPerDup="+0
                 +params);
       }
 

Reply via email to