thank you for regex annotation. my folder-name doesn't have special characters.
i will check up for more details about url-regex and crawling.
first time i use nutch-1.0 i had problems with plugins, so i switch to 0.9.

regards,
mailusenet




________________________________
Von: Subhojit Roy <mails...@gmail.com>
An: nutch-user@lucene.apache.org
Gesendet: Donnerstag, den 19. November 2009, 16:05:55 Uhr
Betreff: Re: AW: substitute unknown parts of the url

yes [a-zA-Z]* will not match those names that contain special characters
like say -,!,@ etc. The other possibility is to try .* where . represents
any character (including special characters).

Interestingly when we tried the [a-zA-Z]* pattern with Nutch 1.0, it had
worked for us.

-sroy

On Thu, Nov 19, 2009 at 7:58 PM, Ken Krugler <kkrugler_li...@transpac.com>wrote:

>
> On Nov 19, 2009, at 2:15am, Myname To wrote:
>
>  Ken, thank you for answering my question.
>>
>> i try [^/]+ for the unknown part of the url, but unfortunately i get the
>> log:
>> ...
>> Stopping at depth=0 - no more URLs to fetch.
>> No URLs to fetch - check your seed list and URL filters.
>> crawl finished: crawl
>>
>> i try this and other code:
>>
>> http://([a-z0-9]*\.)*website.com+[a-zA-Z]+/known-folder/
>> http://([a-z0-9]*\.)*website.com(/*)(/known-folder)
>>
>> actually i don't realy unterstand using predefined char in this case. eg.
>> which part is to parenthesize, or when i have to use asterisk *, plus + or
>>  backslash follow by point \. and so on ..
>>
>
> You'll need to understand regular expressions if you plan to modify the URL
> filter patterns.
>
>
>  if the unknown part of the path has a name, isn't better to use something
>> like [a-zA-Z] or do i have  to add other chars in [^/]+ ?
>>
>
> [^/]+says to match one or more characters which are not equal to '/'. So
> that will match anything, versus the more explicit [a-zA-Z]+, which wouldn't
> match (for example) "some-folder".
>
> -- Ken
>
>
>
>
>  Von: Ken Krugler <kkrugler_li...@transpac.com>
>> An: nutch-user@lucene.apache.org
>> Gesendet: Donnerstag, den 19. November 2009, 2:06:53 Uhr
>> Betreff: Re: substitute unknown parts of the url
>>
>>
>> On Nov 18, 2009, at 4:53pm, Myname To wrote:
>>
>>  hello
>>>
>>> can somebody help me with urlfilter. i need to fetch sites with this
>>> pattern:
>>>
>>> http://([a-z0-9]*\.)*website.com/unknown-folder/known-folder/
>>>
>>> first folder can vary, whereas host name and second folder are known.
>>>
>>> how can i substitute unknown parts (folders) of the url?
>>>
>>
>> Something like...
>>
>> http://([a-z0-9]*\.)*website.com/[ <http://website.com/%5B>
>> ^/]+/known-folder/
>>
>> -- Ken
>>
>> --------------------------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
>> gegen Massenmails.
>> http://mail.yahoo.com
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>


-- 
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: s...@profound.in
http://www.profound.in


__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen 
Massenmails. 
http://mail.yahoo.com 

Reply via email to