On Nov 19, 2009, at 2:15am, Myname To wrote:

Ken, thank you for answering my question.

i try [^/]+ for the unknown part of the url, but unfortunately i get the log:
...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

i try this and other code:

http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
http://([a-z0-9]*\.)*website.com(/*)(/known-folder)

actually i don't realy unterstand using predefined char in this case. eg. which part is to parenthesize, or when i have to use asterisk *, plus + or backslash follow by point \. and so on ..

You'll need to understand regular expressions if you plan to modify the URL filter patterns.

if the unknown part of the path has a name, isn't better to use something like [a-zA-Z] or do i have to add other chars in [^/]+ ?

[^/]+says to match one or more characters which are not equal to '/'. So that will match anything, versus the more explicit [a-zA-Z]+, which wouldn't match (for example) "some-folder".

-- Ken



Von: Ken Krugler <kkrugler_li...@transpac.com>
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
Betreff: Re: substitute unknown parts of the url


On Nov 18, 2009, at 4:53pm, Myname To wrote:

hello

can somebody help me with urlfilter. i need to fetch sites with this
pattern:

http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/

first folder can vary, whereas host name and second folder are known.

how can i substitute unknown parts (folders) of the url?

Something like...

http://([a-z0-9]*\.)*website.com/[^/]+/known-folder/

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails.
http://mail.yahoo.com

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to