[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12414441 ] 

Stefan Groschupf commented on NUTCH-288:
----------------------------------------

HI Stefan, 
>Also it does go back page by page until you get to the last result-page
Isn't it possible to caculate the latest page instead of using a while loop to 
find the latest page?


> hitsPerSite-functionality "flawed": problems writing a page-navigation
> ----------------------------------------------------------------------
>
>          Key: NUTCH-288
>          URL: http://issues.apache.org/jira/browse/NUTCH-288
>      Project: Nutch
>         Type: Bug

>   Components: web gui
>     Versions: 0.8-dev
>     Reporter: Stefan Neufeind
>  Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
> to problems when trying to offer a page-navigation (e.g. allow the user to 
> jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
> it to display 10 items per page. My "naive" approach was to estimate I have 
> 7763/10 = 777 pages. But already when moving to page 3 I got no more 
> searchresults (I guess because of dedup). And when moving to page 10 I  got 
> an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
> servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
>         at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
>         at 
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
>         at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at 
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
>         at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
>         at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
>         at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
>         at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
>         at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>         at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
>         at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
>         at 
> org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
>         at 
> org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
>         at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
>         at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
>         at 
> org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
>         at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup 
> myself and cache the RSS-result to improve performance. But a cleaner 
> solution would imho be nice. Is there a performant way of doing deduplication 
> and knowing for sure how many documents are available to view? For sure this 
> would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to