Hi:

In order to find a specific text or subject or group of text you need
to process the document i.e. you need to download the page to your
disk -- process it -- delete or keep based on rules. But you still need
to download the page. This means you will need a lot of disk space "temporarily"
if you are planning to crawl the world :-)

there is a creative commons plugin in nutch src/plugin/creativecommons .. which
does somewhat similar things could be good starting point. As you have lot
of time then its best you make the new plugin a bit generic :-) So we can all
enjoy it!

Cheers

On 1/9/07, Tor Harald Thorland <[EMAIL PROTECTED]> wrote:

Hello,

I have a question about Nutch..
I'm a total newbi and are wondering:
Is it possible to setup nutch to crawl any address it finds, and only
store pages where he finds something about a subject...
I'll like to make a search place for ship/engine related material, and
were thinking to start with .no domains... ( I have lots of time for
this, ans the pages I'm looking for is not really getting "outdated",
but i don't like to waste a lot of disk space etc. for pages which
don't include what I'm looking for

Best Regards
Tor Harald Thorland



Reply via email to