Hi , I am trying to run the crawl inside a scala program without using bin/nutch command, I am adding all the environment variables which are set by nutch.sh when crawl is running through bin/nutch command. And i am calling the Crawl.main(prams) class and i am getting the following error Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:160) at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) and here is the code i am trying to write for { (line) <- Source.fromFile("/root/classpaths.sh").getLines } if(line != null){
var bo:Array[Byte] = new Array[Byte](100); var cmd:Array[String] = new Array[String](3); cmd(0)="bash" cmd(1)="-c" cmd(2)=line; var checkingCrawl:Process = Runtime.getRuntime().exec(cmd); } var params:Array[String] = new Array[String](5); params(0)="urls" params(1)="-dir" params(2)="insidejava" params(3)="-depth" params(4)="1" org.apache.nutch.crawl.Crawl.main(params); contents of classpaths.sh: JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # check envvars which might override default args if [ "$NUTCH_HEAPSIZE" != "" ]; then #echo "run with heapsize $NUTCH_HEAPSIZE" JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" #echo $JAVA_HEAP_MAX fi # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar # so that filenames w/ spaces are handled correctly in loops below IFS= # for developers, add plugins, job & test code to CLASSPATH if [ -d "$NUTCH_HOME/build/plugins" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build fi if [ -d "$NUTCH_HOME/build/test/classes" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes fi if [ $IS_CORE == 0 ] then for f in $NUTCH_HOME/build/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done # for releases, add Nutch job to CLASSPATH for f in $NUTCH_HOME/nutch-*.job; do CLASSPATH=${CLASSPATH}:$f; done else CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes fi # add plugins to classpath if [ -d "$NUTCH_HOME/plugins" ]; then CLASSPATH=${NUTCH_HOME}:${CLASSPATH} fi # add libs to CLASSPATH for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # setup 'java.library.path' for native-hadoop code if necessary JAVA_LIBRARY_PATH='' if [ -d "${NUTCH_HOME}/build/native" -o -d "${NUTCH_HOME}/lib/native" ]; then JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` if [ -d "$NUTCH_HOME/build/native" ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d "${NUTCH_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} fi fi fi # restore ordinary behaviour unset IFS # default log directory & file if [ "$NUTCH_LOG_DIR" = "" ]; then NUTCH_LOG_DIR="$NUTCH_HOME/logs" fi if [ "$NUTCH_LOGFILE" = "" ]; then NUTCH_LOGFILE='hadoop.log' fi NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR" NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE" if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" fi contents of hadoop.log: 2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava 2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls 2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10 2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1 2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting 2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb: insidejava/crawldb 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file: url. Can't load plugins from: jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Extension-Points: 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE 2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) how to solve this issue any idea please reply to this... Thanks in advance.. ----Sailaja DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.