In other words, you want to crawl whole site, but index only some pages?

To be honest this is something I would like to do also. I finish check
it yet, but seems that you can write IndexingFilter, which would throw
exception if the page shouldn't be indexed. Unfortunatelly you cannot
return null, bacause there is null pointer exception. Throwing the
exception, causes a warn log message, which may cause log overload if
you have a large site.

I hope it helps,
Marcin Okraszewski


On 5/5/07, simon_ece <[EMAIL PROTECTED]> wrote:

hi, thanks for the reply,

this is my conf/Crawl-url filter file content

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*example.com/



# skip everything else
-.

its crawling the whole site and i can view all the related matches while
searching,
but i need to filter out someof the pages
for eg:
if i search for some category (red)
this will list out all the links ;
but i do want to show only a particular link which should matches the
regular expression

^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

kindly post your suggestion
Regards,
Simon
__________________________________________________________________

Marcin Okraszewski wrote:
>
> How about  conf/crawl-urlfilter.txt  ??
>
> Marcin
>
> On 5/4/07, simon_ece <[EMAIL PROTECTED]> wrote:
>>
>> hi all,
>> i am new to Nutch. I would like to crawl a particular site and get the
>> result in the following pattern.I dont want to list other urls from the
>> Crwaled site.
>>
>> Site to be Crwal :eg" www.example.com
>> 
^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>>
>> i can crawl and geting all the matching urls from the site,
>> i dont know how to filterout the urls and get only the particular urls,
>> kindly post the suggestions
>> Thanks & Regards
>> Simon
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
>
>

--
View this message in context: 
http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300
Sent from the Nutch - User mailing list archive at Nabble.com.


Reply via email to