[
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reopened NUTCH-2011:
------------------------------------
Sorry, but this needs some rework:
- after 35.000+ fetched pages and the default max. heap size of 1000M fetcher
becomes slow and throws mainly parser timeouts and catched OOM exceptions. Only
small HTML pages with few outlinks per page have been crawled - the limit is
reached sooner if there are many overlong outlinks or big PDF documents.
- why an in-memory "database" of page-related information (URL, title, outlinks
+ anchor texts)?
-- all information is available in CrawlDb, LinkDb, segments
-- MapReduce job counters provide instant progress information (e.g, number of
fetched pages)
-- if required a queue of limited total size should be used
- in any case, this feature should be optional and off per default if
NutchServer is not used
- "reporting" to FetchNodeDb is off if fetcher.parse is false (the default)? Is
this intended? Construction of FetchNodes is then useless work.
- no traces to System.out: "FetchNodeDb : putting node ..."
> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
> Issue Type: Sub-task
> Components: fetcher, REST_api
> Reporter: Sujen Shah
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a
> real-time JSON response of the current/past Fetched URLs.
> This endpoint also includes pagination of the output to reduce data transfer
> bw in large crawls.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)