Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by DawidWeiss: http://wiki.apache.org/nutch/ClusteringPlugin The comment on the change is: Custom clustering processes. ------------------------------------------------------------------------------ </local-process> }}} * The filters you see in the process descriptor should also be available. Some of them are built in in the Carrot2 core, other should be copied from DCS's distribution to the same temporary folder we copied the process definition to. In our case the following filter definition files should be copied: `filter-language-detection-en.bsh`, `filter-tokenizer.bsh`, `filter-case-normalizer.bsh` and `filter-stc.bsh`. - * Process and component descriptors are read as a resoure (relative to classpath). Jetty can be configured to have an additional classpath entry as a folder, but it slightly complicates things (hierarchy of classloaders may result in some hard-to-track errors). It will be easier to just place all the required stuff in Nutch's Web application context under `WEB-INF`. If you work with WAR file directly, you'll need to add the resources mentioned below to the WAR file (it's a ZIP file, so it's not a problem). + * Process and component descriptors are read as a resource (relative to classpath). Jetty can be configured to have an additional classpath entry as a folder, but it slightly complicates things (hierarchy of classloaders may result in some hard-to-track errors). It will be easier to just place all the required stuff in Nutch's Web application context under `WEB-INF`. If you work with WAR file directly, you'll need to add the resources mentioned below to the WAR file (it's a ZIP file). * Copy process and component descriptor files to `{NUTCH-CONTEXT}/WEB-INF/classes/`. - * Copy all JAR files from the DCS (`WEB-INF/lib/*.jar`) to `{NUTCH-CONTEXT}/WEB-INF/lib`. Overwrite older libraries, whenever prompted. + * Copy certain JAR files from the DCS (`WEB-INF/lib/*.jar`) to `{NUTCH-CONTEXT}/WEB-INF/lib`. Which JARs should be copied is not an easy question to answer. In general, you can copy everything that won't clash with your Web container. We suggest ''not'' to copy the following: `carrot2-util-log4j*.jar` (log4j configuration files), `commons-logging*.jar` (clashes with Nutch's version), `jasper*.jar` and `org.mortbay*.jar` (already present in Web containers). The rest should be safe to just copy, overwriting anything present in Nutch. * Finally, the path to the clustering process should be added to `{NUTCH-CONTEXT}/WEB-INF/classes/nutch-site.xml`: {{{ <property> @@ -121, +121 @@ <value>/alg-stc-en.xml</value> </property> }}} - * Restart your Web application container. The clustering plugin should use STC clustering algorithm if everything was ok. + * Restart your Web application container. The clustering plugin should use STC clustering algorithm if everything was ok. If something is wrong, inspect your log files -- they usually indicate the problem (i.e., missing classes) quite clearly.