Now I realized that if index contains no documents and there are urls in the DB 
then it it generates error while removing duplicates. To get rid of error, I 
did an hack in dedup method of DeleteDuplicates class
   
  //Removing duplicates
  try{
  JobClient.runJob(job);
  }catch(Exception e){
  LOG.info("Dedup: Error occurred: "+e.getMessage());
  }
   
  This solves my problem.
   
  -BR
  

ajaxtrend <[EMAIL PROTECTED]> wrote:
  I tried to added a meta-tag in my customized indexed filter. Based on URL 
pattern, I added a meta-tag called 'indexit', with value true or false. 

In the reduce() method indexer, I checked this meta-tag and accordingly indexed 
the document or not, something like in the end of reduce() method

String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}

This works, as the document does not get indexed. However , it gives IOException

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I understand that DeleteDuplicate tries to remove duplicates of some URLs, and 
there are no documents indexed for these URLs , thats why it gives expection.

Any suggestion to run it gracefully, I mean is this the write way of 
controlling index process of a document?

I would really appreciate your suggestions.

- BR


ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the 
pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com 
that are sub directory of product so my regex is

+^http://www.text.com/products/.*

urls/my.txt contains following entry

http://www.text.com, that mean I want to start indexing from main page of 
www.text.com. However nutch does not index anything and when I run nutch it says

No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not 
able to find any suggestion. 
I would really appreciate if you can put your valuable suggestion or let me 
know the classes to be looked into.

Thanks in advance.

- BR


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Reply via email to