[jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Sebastian Nagel (JIRA) Fri, 15 May 2015 15:17:49 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel reopened NUTCH-2011:
------------------------------------

Sorry, but this needs some rework:
- after 35.000+ fetched pages and the default max. heap size of 1000M fetcher 
becomes slow and throws mainly parser timeouts and catched OOM exceptions. Only 
small HTML pages with few outlinks per page have been crawled - the limit is 
reached sooner if there are many overlong outlinks or big PDF documents.
- why an in-memory "database" of page-related information (URL, title, outlinks 
+ anchor texts)?
-- all information is available in CrawlDb, LinkDb, segments
-- MapReduce job counters provide instant progress information (e.g, number of 
fetched pages)
-- if required a queue of limited total size should be used
- in any case, this feature should be optional and off per default if 
NutchServer is not used
- "reporting" to FetchNodeDb is off if fetcher.parse is false (the default)? Is 
this intended? Construction of FetchNodes is then useless work.
- no traces to System.out: "FetchNodeDb : putting node ..."

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Reply via email to