I thought I would follow up on this for anyone who has also had the problem.
I found the root of the problem to be that conf/prefix-url.txt is not
included in the nutch-0.8.1 download on the site. Therefore the file cannot
be loaded when running the inject/generate/etc. calls.
I'm not sure why the crawl command still worked properly, but adding the
file and filling it with 'http' solved my problem.
-Charlie
On 2/12/07, Charlie Williams <[EMAIL PROTECTED]> wrote:
yes I have been debugging, everything looks fine as it goes into the
mapper code,
from Injector.java
@line 69
try
{
url = urlNormalizer.normalize(url);
url = filters.filter(url); <- this is what returns null
} catch ( ... )
...
}
if (url != null) { <-- this check always fails because of that
...
}
I trace the call in to PrefixURLFilter.filter(url) and always get a null
returned from here...
if (trie.shortestMatch(url)== null)
return null;
else
return url;
Does this clarify the root of the problem?
-Charlie
On 2/12/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
>
> Hey Charlie,
>
> What do the logs say in logs/hadoop.log?
>
> You can also try to to set a breakpoint in Eclipse in the map method of
> InjectMapper and reduce method of InjectReducer. When you get there in
> debug mode, inspect your variables and check if everything looks good.
> You can also check if your urls make it through: url =
> filters.filter(url); in InjectMapper
>
> HTH,
> Renaud
>
>
> Charlie Williams wrote:
> > I have been trying to learn the Nutch code base by stepping through
> > the code
> > in debug mode of Eclipse. However I am unable to understand a piece of
> > code
> > in the Injector.
> >
> > When I run the crawl command used for intranet crawling, it
> successfully
> > injects urls into the database. When I run standalone Injector, on the
> > same
> > set of urls it injects nothing, returning null from each pass of
> > PrefixURLFilter.filter( url )
> >
> > I saw in an achieve that that the crawl command uses crawl-tool.xml
> > for its
> > config, where otherwise nutch-site.xml is used. So I made the
> > nutch-site.xmlfile exactly the same, but this seemed to have no
> > result. Does anyone know
> > why?
> >
> > I apologize for the newb question, but any help would be greatly
> > appreciated.
> >
> > -Charlie
> >
>
>
> --
> Renaud Richardet +1 617 230 9112
> my email is my first name at apache.org http://www.oslutions.com
>
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general