[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Sujen Shah (JIRA) Mon, 18 May 2015 03:05:45 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547812#comment-14547812
 ]


Sujen Shah commented on NUTCH-2011:
-----------------------------------

Hi [~wastl-nagel], 
Just to add a little to Asitang's reply, 

- "fetch round" means one fetch job, this basically corresponds to "bin/nutch 
fetch ...." in the crawl script. 

- "greater depth fetch rounds" means longer fetch lists, corresponding to 
higher iteration numbers specified in the noOfRounds parameter while running 
the bin/crawl script. 

As Asitang mentioned, the FetchNodeDb is used to make a D3 graph (currently a 
BFS tree) to show the progress of the crawl, we would need the "round"(or 
iteration) in which it was fetched to make the graph. 

There was some initial discussion about modifying the CrawlDb to hold one more 
parameter which is the round number. But since a FetchNodeDb was created to 
store real-time information, the idea of modifying the crawldb was dropped. 

One point from your comment on NUTCH-2015, the reason to store the FetchNodes 
in an enumerated manner was so that the client could paginate his requests to 
reduce the amount of bandwidth used. This was done to take care of client side 
failures in large crawls. This option is not currently supported by any 
persistent databases used (CrawlDb/LinkDb, etc)


> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

Reply via email to