Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "ErrorMessagesInNutch2" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/ErrorMessagesInNutch2?action=diff&rev1=10&rev2=11

  This page acts as a repository for potential error messages you might 
experience whilst using Nutch 2.0. It will most likely be dynamic and fairly 
general in nature due to the variety of additional software projects which can 
be combined with Nutch 2.0 and the potential for errors which this presents 
both for Nutch and which need to be considered when working with other software 
projects in combination.
  
  <<TableOfContents(3)>>
+ 
+ == Nutch logging shows Skipping http://myurlForParsing.com; different batch 
id (null) ==
+ 
+ If your logging level is set to DEBUG, this may occur whilst executing 
'''FetcherJob''', '''ParserJob''' and '''IndexerJob'''.
+ For example, within ParserJob#map, the situation arises where the 
!NutchJob.shouldProcess returns true due to the fact that 
Mark.FETCH_MARK.checkMark(page) returns value null. 
+ The code for this can be seen below.
+ 
+ [code]
+ 
+     @Override
+     public void map(String key, WebPage page, Context context)
+         throws IOException, InterruptedException {
+       Utf8 mark = Mark.FETCH_MARK.checkMark(page);
+       String unreverseKey = TableUtil.unreverseUrl(key);
+       if (batchId.equals(REPARSE)) {
+         LOG.debug("Reparsing " + unreverseKey);
+       } else {
+         if (!NutchJob.shouldProcess(mark, batchId)) {
+           if (LOG.isDebugEnabled()) {
+             LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; 
different batch id (" + mark + ")");
+           }
+           return;
+ [code]
+ 
+ What we wish to know is in which scenarios it is possible to have a page 
which we attempt to fetch, parse or index which has a null value for *_MARK?
+ 
+ * Well, whilst the Jobs are executing this can occur for example as you have 
to load all backend entries, as there are no filters ("where" clauses in SQL) 
in Apache Gora. This means that you will see a lot of entries with wrong 
mark's. 
+ * Null values are possible, too, think about these steps: inject -> generate 
-> inject -> fetch. The second inject will leave entries in the db without 
fetchmarks seen by the fetcher later.
+ 
+ It seems to be that updating the web database with the DBUpdaterJob, sorts 
this out.
  
  == gora-cassandra >0.2 InvalidRequestException(why:Keyspace webpage does not 
exist) ==
  

Reply via email to