Re: URL filter help

ajaxtrend Mon, 17 Dec 2007 13:01:30 -0800

Can anybody help me on this exception? Because of this exception index is 
corrupted.


ajaxtrend <[EMAIL PROTECTED]> wrote:  Now I realized that if index contains no 
documents and there are urls in the DB then it it generates error while 
removing duplicates. To get rid of error, I did an hack in dedup method of 
DeleteDuplicates class

//Removing duplicates
try{
JobClient.runJob(job);
}catch(Exception e){
LOG.info("Dedup: Error occurred: "+e.getMessage());
}

This solves my problem.

-BR


ajaxtrend wrote:
I tried to added a meta-tag in my customized indexed filter. Based on URL 
pattern, I added a meta-tag called 'indexit', with value true or false. 

In the reduce() method indexer, I checked this meta-tag and accordingly indexed 
the document or not, something like in the end of reduce() method

String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}

This works, as the document does not get indexed. However , it gives IOException

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I understand that DeleteDuplicate tries to remove duplicates of some URLs, and 
there are no documents indexed for these URLs , thats why it gives expection.

Any suggestion to run it gracefully, I mean is this the write way of 
controlling index process of a document?

I would really appreciate your suggestions.

- BR


ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the 
pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com 
that are sub directory of product so my regex is

+^http://www.text.com/products/.*

urls/my.txt contains following entry

http://www.text.com, that mean I want to start indexing from main page of 
www.text.com. However nutch does not index anything and when I run nutch it says

No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not 
able to find any suggestion. 
I would really appreciate if you can put your valuable suggestion or let me 
know the classes to be looked into.

Thanks in advance.

- BR


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.

       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: URL filter help

Reply via email to