thank you for regex annotation. my folder-name doesn't have special characters. i will check up for more details about url-regex and crawling. first time i use nutch-1.0 i had problems with plugins, so i switch to 0.9.
regards, mailusenet ________________________________ Von: Subhojit Roy <mails...@gmail.com> An: nutch-user@lucene.apache.org Gesendet: Donnerstag, den 19. November 2009, 16:05:55 Uhr Betreff: Re: AW: substitute unknown parts of the url yes [a-zA-Z]* will not match those names that contain special characters like say -,!,@ etc. The other possibility is to try .* where . represents any character (including special characters). Interestingly when we tried the [a-zA-Z]* pattern with Nutch 1.0, it had worked for us. -sroy On Thu, Nov 19, 2009 at 7:58 PM, Ken Krugler <kkrugler_li...@transpac.com>wrote: > > On Nov 19, 2009, at 2:15am, Myname To wrote: > > Ken, thank you for answering my question. >> >> i try [^/]+ for the unknown part of the url, but unfortunately i get the >> log: >> ... >> Stopping at depth=0 - no more URLs to fetch. >> No URLs to fetch - check your seed list and URL filters. >> crawl finished: crawl >> >> i try this and other code: >> >> http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/ >> http://([a-z0-9]*\.)*website.com(/*)(/known-folder) >> >> actually i don't realy unterstand using predefined char in this case. eg. >> which part is to parenthesize, or when i have to use asterisk *, plus + or >> backslash follow by point \. and so on .. >> > > You'll need to understand regular expressions if you plan to modify the URL > filter patterns. > > > if the unknown part of the path has a name, isn't better to use something >> like [a-zA-Z] or do i have to add other chars in [^/]+ ? >> > > [^/]+says to match one or more characters which are not equal to '/'. So > that will match anything, versus the more explicit [a-zA-Z]+, which wouldn't > match (for example) "some-folder". > > -- Ken > > > > > Von: Ken Krugler <kkrugler_li...@transpac.com> >> An: nutch-user@lucene.apache.org >> Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr >> Betreff: Re: substitute unknown parts of the url >> >> >> On Nov 18, 2009, at 4:53pm, Myname To wrote: >> >> hello >>> >>> can somebody help me with urlfilter. i need to fetch sites with this >>> pattern: >>> >>> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/ >>> >>> first folder can vary, whereas host name and second folder are known. >>> >>> how can i substitute unknown parts (folders) of the url? >>> >> >> Something like... >> >> http://([a-z0-9]*\.)*website.com/[ <http://website.com/%5B> >> ^/]+/known-folder/ >> >> -- Ken >> >> -------------------------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://bixolabs.com >> e l a s t i c w e b m i n i n g >> >> __________________________________________________ >> Do You Yahoo!? >> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz >> gegen Massenmails. >> http://mail.yahoo.com >> > > -------------------------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > -- Subhojit Roy Profound Technologies (Search Solutions based on Open Source) email: s...@profound.in http://www.profound.in __________________________________________________ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com