[
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547812#comment-14547812
]
Sujen Shah commented on NUTCH-2011:
-----------------------------------
Hi [~wastl-nagel],
Just to add a little to Asitang's reply,
- "fetch round" means one fetch job, this basically corresponds to "bin/nutch
fetch ...." in the crawl script.
- "greater depth fetch rounds" means longer fetch lists, corresponding to
higher iteration numbers specified in the noOfRounds parameter while running
the bin/crawl script.
As Asitang mentioned, the FetchNodeDb is used to make a D3 graph (currently a
BFS tree) to show the progress of the crawl, we would need the "round"(or
iteration) in which it was fetched to make the graph.
There was some initial discussion about modifying the CrawlDb to hold one more
parameter which is the round number. But since a FetchNodeDb was created to
store real-time information, the idea of modifying the crawldb was dropped.
One point from your comment on NUTCH-2015, the reason to store the FetchNodes
in an enumerated manner was so that the client could paginate his requests to
reduce the amount of bandwidth used. This was done to take care of client side
failures in large crawls. This option is not currently supported by any
persistent databases used (CrawlDb/LinkDb, etc)
> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
> Issue Type: Sub-task
> Components: fetcher, REST_api
> Reporter: Sujen Shah
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a
> real-time JSON response of the current/past Fetched URLs.
> This endpoint also includes pagination of the output to reduce data transfer
> bw in large crawls.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)