Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "ErrorMessagesInNutch2" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/ErrorMessagesInNutch2?action=diff&rev1=10&rev2=11 This page acts as a repository for potential error messages you might experience whilst using Nutch 2.0. It will most likely be dynamic and fairly general in nature due to the variety of additional software projects which can be combined with Nutch 2.0 and the potential for errors which this presents both for Nutch and which need to be considered when working with other software projects in combination. <<TableOfContents(3)>> + + == Nutch logging shows Skipping http://myurlForParsing.com; different batch id (null) == + + If your logging level is set to DEBUG, this may occur whilst executing '''FetcherJob''', '''ParserJob''' and '''IndexerJob'''. + For example, within ParserJob#map, the situation arises where the !NutchJob.shouldProcess returns true due to the fact that Mark.FETCH_MARK.checkMark(page) returns value null. + The code for this can be seen below. + + [code] + + @Override + public void map(String key, WebPage page, Context context) + throws IOException, InterruptedException { + Utf8 mark = Mark.FETCH_MARK.checkMark(page); + String unreverseKey = TableUtil.unreverseUrl(key); + if (batchId.equals(REPARSE)) { + LOG.debug("Reparsing " + unreverseKey); + } else { + if (!NutchJob.shouldProcess(mark, batchId)) { + if (LOG.isDebugEnabled()) { + LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")"); + } + return; + [code] + + What we wish to know is in which scenarios it is possible to have a page which we attempt to fetch, parse or index which has a null value for *_MARK? + + * Well, whilst the Jobs are executing this can occur for example as you have to load all backend entries, as there are no filters ("where" clauses in SQL) in Apache Gora. This means that you will see a lot of entries with wrong mark's. + * Null values are possible, too, think about these steps: inject -> generate -> inject -> fetch. The second inject will leave entries in the db without fetchmarks seen by the fetcher later. + + It seems to be that updating the web database with the DBUpdaterJob, sorts this out. == gora-cassandra >0.2 InvalidRequestException(why:Keyspace webpage does not exist) ==

