Hi ,
I am trying to run the crawl inside a scala program without using
bin/nutch command, I am adding all the environment variables which are set by
nutch.sh when crawl is running through bin/nutch command. And i am calling the
Crawl.main(prams) class and i am getting the following error Exception in
thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
and here is the code i am trying to write
for {
(line) <- Source.fromFile("/root/classpaths.sh").getLines
} if(line != null){
var bo:Array[Byte] = new Array[Byte](100);
var cmd:Array[String] = new Array[String](3);
cmd(0)="bash"
cmd(1)="-c"
cmd(2)=line;
var checkingCrawl:Process = Runtime.getRuntime().exec(cmd);
}
var params:Array[String] = new Array[String](5);
params(0)="urls"
params(1)="-dir"
params(2)="insidejava"
params(3)="-depth"
params(4)="1"
org.apache.nutch.crawl.Crawl.main(params);
contents of classpaths.sh:
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
# check envvars which might override default args
if [ "$NUTCH_HEAPSIZE" != "" ]; then
#echo "run with heapsize $NUTCH_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi
# CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf
CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
# so that filenames w/ spaces are handled correctly in loops below
IFS=
# for developers, add plugins, job & test code to CLASSPATH
if [ -d "$NUTCH_HOME/build/plugins" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
fi
if [ -d "$NUTCH_HOME/build/test/classes" ]; then
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
fi
if [ $IS_CORE == 0 ]
then
for f in $NUTCH_HOME/build/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
# for releases, add Nutch job to CLASSPATH
for f in $NUTCH_HOME/nutch-*.job; do
CLASSPATH=${CLASSPATH}:$f;
done
else
CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
fi
# add plugins to classpath
if [ -d "$NUTCH_HOME/plugins" ]; then
CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
fi
# add libs to CLASSPATH
for f in $NUTCH_HOME/lib/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
# setup 'java.library.path' for native-hadoop code if necessary
JAVA_LIBRARY_PATH=''
if [ -d "${NUTCH_HOME}/build/native" -o -d "${NUTCH_HOME}/lib/native" ]; then
JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA}
org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
if [ -d "$NUTCH_HOME/build/native" ]; then
JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
fi
if [ -d "${NUTCH_HOME}/lib/native" ]; then
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
else
JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
fi
fi
fi
# restore ordinary behaviour
unset IFS
# default log directory & file
if [ "$NUTCH_LOG_DIR" = "" ]; then
NUTCH_LOG_DIR="$NUTCH_HOME/logs"
fi
if [ "$NUTCH_LOGFILE" = "" ]; then
NUTCH_LOGFILE='hadoop.log'
fi
NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR"
NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE"
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
fi
contents of hadoop.log:
2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava
2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls
2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10
2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1
2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting
2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb:
insidejava/crawldb
2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls
2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file:
url. Can't load plugins from: jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins
2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins:
2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE
2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered
Extension-Points:
2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE
2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not
found.
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122)
at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
how to solve this issue any idea please reply to this...
Thanks in advance..
----Sailaja
DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized to read, retain, copy, print, distribute or
use this message. If you have received this communication in error, please
notify the sender and delete all copies of this message. Persistent Systems
Ltd. does not accept any liability for virus infected mails.