[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547778#comment-14547778
 ] 

Asitang Mishra commented on NUTCH-2011:
---------------------------------------

Hi [~wastl-nagel], 
-The answer to your first two questions is Yes, your interpretations are 
correct.
-Third question: The FetchNodeDb info will be used to make a D3 graph, that 
will in real time give information of which page is being fetched, and if 
fetched properly, what outlinks it generated. We need to output this as a 
visualization before the data is being written into the segments. 
-I agree that we don't need an extra persistent layer as all the data is 
already stored segment wise which is same as "round wise", me and 
[~chrismattmann] had discussed it before. 
- Although a buffer queue is an appealing idea, but we are not using it because 
we wanted to make things more RESTful (so the user/graph can request pages from 
any to any index from the temporary store/NodeDb or all the data from any 
previously updated specific segment). Also, in case of a failure if the program 
requests the nodes again and the buffer queue does not have it, then we will 
have to wait for the round to end and read it from the segment. But, we can 
delve into [~wastl-nagel] 's idea if I guess some strict or cautionary measures 
are taken at the client side :) . What do you think [~chrismattmann] and 
[~sujenshah]. 

 

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to