[ https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546650#comment-14546650 ]
Sujen Shah commented on NUTCH-2011: ----------------------------------- Thank you for your inputs [~wastl-nagel]. I agree the feature should be optional and off by default and have submitted a PR to handle that - https://issues.apache.org/jira/browse/NUTCH-2015 The FetchNodeDb is required a part of getting real-time information from the fetcher while its running. Currently CrawlDb, LinkDb, segments give page related information only when the entire fetch round is complete. Greater depth fetch rounds could go on for hours without giving out any information about what pages are being crawled until its complete. To address this issue a real-time reporting functionality was needed and hence the FetchNodeDb. Currently the database is "in-memory", but we are brainstorming ways to make it persistent to reduce the memory usage. Some ideas are : 1. Write the FetchNodeDb to file after each fetch round (i.e at each depth level) 2. Keeping a threshold on the number of FetchNodes within the Db and then dumping onto a file (similar to crawldb/linkdb) What would be your suggestions to achieve the above ? Regarding reporting to FetchNodeDb while fetcher.parse is false is off. Initially this was intended, as the output from the FetchNodeDb is going to be used to create a D3 graph which would update dynamically. For the graph, we required the outlinks from a URL which we can get only after parsing it. But you have correctly pointed out that the construction of FetchNodes is useless work when fetcher.parse is off (default config). I will create a patch for this and submit a PR. Thanks > Endpoint to support realtime JSON output from the fetcher > --------------------------------------------------------- > > Key: NUTCH-2011 > URL: https://issues.apache.org/jira/browse/NUTCH-2011 > Project: Nutch > Issue Type: Sub-task > Components: fetcher, REST_api > Reporter: Sujen Shah > Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > > This fix will create an endpoint to query the Nutch REST service and get a > real-time JSON response of the current/past Fetched URLs. > This endpoint also includes pagination of the output to reduce data transfer > bw in large crawls. -- This message was sent by Atlassian JIRA (v6.3.4#6332)