thank you sroy, as i wrote to ken, i don't clearly understand regex in this case. with your regex suggestion i get now error-log:
Injector: Converting injected urls to crawl db entries. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) i am using nutch-0.9 on redhat. and there is no problem with url like +^http://([a-z0-9]*\.)*website.com/known-folder/known-folder/ any other suggestions? regards, mailusenet ________________________________ Von: Subhojit Roy <mails...@gmail.com> An: nutch-user@lucene.apache.org Gesendet: Donnerstag, den 19. November 2009, 10:13:12 Uhr Betreff: Re: substitute unknown parts of the url Hi, Try the regular expression below. +^http://([a-z0-9]*\.)*website.com/*[a-z0-9]**/known-folder/ -sroy On Thu, Nov 19, 2009 at 6:23 AM, Myname To <mailuse...@yahoo.de> wrote: > hello > > can somebody help me with urlfilter. i need to fetch sites with this > pattern: > > http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/ > > first folder can vary, whereas host name and second folder are known. > > how can i substitute unknown parts (folders) of the url? > > any help appreciated! > > regards > mailusenet > > > -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in __________________________________________________ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com