Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchGotchas" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchGotchas?action=diff&rev1=5&rev2=6

- The following list acts as a comprehensive list of Nutch "Gotchas" which 
should act as a suitable prerequisite source of implicit information currently 
existing in the Nutch Codebase and in its general usage.
+ The following acts as a comprehensive list of Nutch "Gotchas" which should 
act as a suitable prerequisite source of implicit information currently 
existing in the Nutch Codebase and in its general usage.
  
  == Developing Nutch: Gotchas ==
  
- Developing Nutch Gotchas should be driven purely by community movement and 
consensus that it is necessary to make implicit information explicit in an 
attempt to create an earier working environment for Nutch users at all levels. 
The list below has been compiled as a repository of information which emerged 
during discussions on the user@ list. As with many areas of the Nutch wiki, 
this list exists as a non static resource and all Nutch users are invited to 
edit based upon experience and community consensus.
+ Developing Nutch Gotchas should be driven purely by community opinion and 
consensus that it is necessary to make implicit information explicit in an 
attempt to create an easier working environment for Nutch users at all levels. 
The list below has been compiled as a repository of information which emerged 
during discussions on various lists. As with many areas of the Nutch wiki, this 
list exists as a non static resource and all Nutch users are invited to edit 
based upon experience and community consensus.
+ 
+ <<TableOfContents(3)>>
  
  == Current Gotchas and using them: ==
  
- '''No agents listed in 'http.agent.name' property''': 
+ === No agents listed in 'http.agent.name' property ===: 
  
  Since 1.3 Nutch is called from either of the runtime dirs (runtime/local and 
runtime/deploy). The conf files should be modified in runtime/local/conf, not 
in $NUTCH_HOME/conf.
  
- '''Nutch-1016: Strip UTF-8 non-character codepoints''':
+ === Nutch-1016: Strip UTF-8 non-character codepoints ===:
  
  This JIRA issue affects the indexer and relates to the stripping of UTF-8 
non-character codepoints which exist within some documents and was initially 
discovered during large crawls. When indexing to Solr this will yield the 
following exception:
  
@@ -25, +27 @@

  The fix (committed by Markus) for the SolrWriter class passes the value of 
the content field to a method to strip away non-characters, effectively 
avoiding the runtime exception. Various patches are available 
[[https://issues.apache.org/jira/browse/NUTCH-1016|here]]
  
  
- '''Removal of crawl-urlfilter.txt''':
+ === Removal of crawl-urlfilter.txt ===:
  
  
  As of the release of Nutch 1.3, crawl-urlfilter.txt has been removed  
purposefully as it did not add anything to the other url filters (automaton | 
regex) in terms of functionality. By default the urlfilters contain (+.) which 
was what the crawl-urlfilter used to do.
  
- '''Confusion about "solrUrl is not set, indexing will be skipped..." log 
message''':
+ === Confusion about "solrUrl is not set, indexing will be skipped..." log 
message ===:
  
  This relates to the removal of the Nutch Lucene legacy dependence to support 
indexing with Solr, and the road map to enable various other indexing 
implementations. We have two options for passing the indexing command to Nutch. 
  
   * During the crawl command, as explained 
[[http://wiki.apache.org/nutch/RunningNutchAndSolr#A3._Crawl_your_first_website|here]].
   * or during the later stage of sending an individual solrindex command to 
Solr as explained 
[[http://wiki.apache.org/nutch/RunningNutchAndSolr#A6._Integrate_Solr_with_Nutch|here]].
 
  
- '''DiskErrorException while fetching''':
+ === DiskErrorException while fetching ===:
  
  Questions like this one arise fairly regularly on the user@ list
  

Reply via email to