On Nov 19, 2009, at 2:15am, Myname To wrote:
Ken, thank you for answering my question.
i try [^/]+ for the unknown part of the url, but unfortunately i get
the log:
...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
i try this and other code:
http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
http://([a-z0-9]*\.)*website.com(/*)(/known-folder)
actually i don't realy unterstand using predefined char in this
case. eg. which part is to parenthesize, or when i have to use
asterisk *, plus + or backslash follow by point \. and so on ..
You'll need to understand regular expressions if you plan to modify
the URL filter patterns.
if the unknown part of the path has a name, isn't better to use
something like [a-zA-Z] or do i have to add other chars in [^/]+ ?
[^/]+says to match one or more characters which are not equal to '/'.
So that will match anything, versus the more explicit [a-zA-Z]+, which
wouldn't match (for example) "some-folder".
-- Ken
Von: Ken Krugler <kkrugler_li...@transpac.com>
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
Betreff: Re: substitute unknown parts of the url
On Nov 18, 2009, at 4:53pm, Myname To wrote:
hello
can somebody help me with urlfilter. i need to fetch sites with this
pattern:
http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
first folder can vary, whereas host name and second folder are known.
how can i substitute unknown parts (folders) of the url?
Something like...
http://([a-z0-9]*\.)*website.com/[^/]+/known-folder/
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden
Schutz gegen Massenmails.
http://mail.yahoo.com
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g