> > We have created a new nutch configuration directory. > > The only difference between this configuration directory > > and the normal one is the automaton-urlfilter.txt, > > crawl-urlfilter.txt, and regex-urlfilter.txt files. > > With the new nutch configuration directory, we set the env > var NUTCH_CONF_DIR. I know the nutch script is getting this > value. I put in some debug statements. I can see it added > properly to CLASSPATH. I also set HADOOP_CONF_DIR. This > also does not have any effect. > > I am checking the access times on the regex-urlfilter.txt file. > The new regex-urlfilter.txt is not accessed. The process is > only accessing the regex-urlfilter.txt file in the $NUTCH_HOME/conf > directory. It does not appear to be using the NUTCH_CONF_DIR. > > Does anyone have any thoughts or ideas for what we can do to > get this to work with the NUTCH_CONF_DIR? Thank you in > advance for any pointers.
I fixed the problem. Before modifying NUTCH_CONF_DIR and HADOOP_CONF_DIR, stop the hadoop processes. Then, modify the NUTCH_CONF_DIR and HADOOP_CONF_DIR. We set them to our special configuration directory. Then, start the hadoop processes. Once the filtering is done, we stop the hadoop processes. Then, we unset the NUTCH_CONF_DIR and HADOOP_CONF_DIR environment variables. Finally, we restart the hadoop processes. Everything works like a charm now. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
