[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546650#comment-14546650
 ] 

Sujen Shah commented on NUTCH-2011:
-----------------------------------

Thank you for your inputs [~wastl-nagel]. 

I agree the feature should be optional and off by default and have submitted a 
PR to handle that - https://issues.apache.org/jira/browse/NUTCH-2015

The FetchNodeDb is required a part of getting real-time information from the 
fetcher while its running. Currently CrawlDb, LinkDb, segments give page 
related information only when the entire fetch round is complete. Greater depth 
fetch rounds could go on for hours without giving out any information about 
what pages are being crawled until its complete. To address this issue a 
real-time reporting functionality was needed and hence the FetchNodeDb.  

Currently the database is "in-memory", but we are brainstorming ways to make it 
persistent to reduce the memory usage. Some ideas are :
1. Write the FetchNodeDb to file after each fetch round (i.e at each depth 
level)
2. Keeping a threshold on the number of FetchNodes within the Db and then 
dumping onto a file (similar to crawldb/linkdb)
What would be your suggestions to achieve the above ? 

Regarding reporting to FetchNodeDb while fetcher.parse is false is off. 
Initially this was intended, as the output from the FetchNodeDb is going to be 
used to create a D3 graph which would update dynamically. For the graph, we 
required the outlinks from a URL which we can get only after parsing it. 
But you have correctly pointed out that the construction of FetchNodes is 
useless work when fetcher.parse is off (default config). I will create a patch 
for this and submit a PR. 

Thanks

> Endpoint to support realtime JSON output from the fetcher
> ---------------------------------------------------------
>
>                 Key: NUTCH-2011
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2011
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, REST_api
>            Reporter: Sujen Shah
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to