Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by DawidWeiss:
http://wiki.apache.org/nutch/ClusteringPlugin

The comment on the change is:
Custom clustering processes.

------------------------------------------------------------------------------
  </local-process>
  }}}
   * The filters you see in the process descriptor should also be available. 
Some of them are built in in the Carrot2 core, other should be copied from 
DCS's distribution to the same temporary folder we copied the process 
definition to. In our case the following filter definition files should be 
copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, 
`filter-case-normalizer.bsh` and `filter-stc.bsh`. 
-  * Process and component descriptors are read as a resoure (relative to 
classpath). Jetty can be configured to have an additional classpath entry as a 
folder, but it slightly complicates things (hierarchy of classloaders may 
result in some hard-to-track errors). It will be easier to just place all the 
required stuff in Nutch's Web application context under `WEB-INF`. If you work 
with WAR file directly, you'll need to add the resources mentioned below to the 
WAR file (it's a ZIP file, so it's not a problem).
+  * Process and component descriptors are read as a resource (relative to 
classpath). Jetty can be configured to have an additional classpath entry as a 
folder, but it slightly complicates things (hierarchy of classloaders may 
result in some hard-to-track errors). It will be easier to just place all the 
required stuff in Nutch's Web application context under `WEB-INF`. If you work 
with WAR file directly, you'll need to add the resources mentioned below to the 
WAR file (it's a ZIP file).
   * Copy process and component descriptor files to 
`{NUTCH-CONTEXT}/WEB-INF/classes/`.
-  * Copy all JAR files from the DCS (`WEB-INF/lib/*.jar`) to 
`{NUTCH-CONTEXT}/WEB-INF/lib`. Overwrite older libraries, whenever prompted.
+  * Copy certain JAR files from the DCS (`WEB-INF/lib/*.jar`) to 
`{NUTCH-CONTEXT}/WEB-INF/lib`. Which JARs should be copied is not an easy 
question to answer. In general, you can copy everything that won't clash with 
your Web container. We suggest ''not'' to copy the following: 
`carrot2-util-log4j*.jar` (log4j configuration files), `commons-logging*.jar` 
(clashes with Nutch's version), `jasper*.jar` and `org.mortbay*.jar` (already 
present in Web containers). The rest should be safe to just copy, overwriting 
anything present in Nutch.
   * Finally, the path to the clustering process should be added to 
`{NUTCH-CONTEXT}/WEB-INF/classes/nutch-site.xml`:
  {{{
  <property>
@@ -121, +121 @@

    <value>/alg-stc-en.xml</value>
  </property>
  }}}
-  * Restart your Web application container. The clustering plugin should use 
STC clustering algorithm if everything was ok.
+  * Restart your Web application container. The clustering plugin should use 
STC clustering algorithm if everything was ok. If something is wrong, inspect 
your log files -- they usually indicate the problem (i.e., missing classes) 
quite clearly.
  

Reply via email to