[
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547778#comment-14547778
]
Asitang Mishra commented on NUTCH-2011:
---------------------------------------
Hi [~wastl-nagel],
-The answer to your first two questions is Yes, your interpretations are
correct.
-Third question: The FetchNodeDb info will be used to make a D3 graph, that
will in real time give information of which page is being fetched, and if
fetched properly, what outlinks it generated. We need to output this as a
visualization before the data is being written into the segments.
-I agree that we don't need an extra persistent layer as all the data is
already stored segment wise which is same as "round wise", me and
[~chrismattmann] had discussed it before.
- Although a buffer queue is an appealing idea, but we are not using it because
we wanted to make things more RESTful (so the user/graph can request pages from
any to any index from the temporary store/NodeDb or all the data from any
previously updated specific segment). Also, in case of a failure if the program
requests the nodes again and the buffer queue does not have it, then we will
have to wait for the round to end and read it from the segment. But, we can
delve into [~wastl-nagel] 's idea if I guess some strict or cautionary measures
are taken at the client side :) . What do you think [~chrismattmann] and
[~sujenshah].
> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
> Issue Type: Sub-task
> Components: fetcher, REST_api
> Reporter: Sujen Shah
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a
> real-time JSON response of the current/past Fetched URLs.
> This endpoint also includes pagination of the output to reduce data transfer
> bw in large crawls.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)