Hi: In order to find a specific text or subject or group of text you need to process the document i.e. you need to download the page to your disk -- process it -- delete or keep based on rules. But you still need to download the page. This means you will need a lot of disk space "temporarily" if you are planning to crawl the world :-)
there is a creative commons plugin in nutch src/plugin/creativecommons .. which does somewhat similar things could be good starting point. As you have lot of time then its best you make the new plugin a bit generic :-) So we can all enjoy it! Cheers On 1/9/07, Tor Harald Thorland <[EMAIL PROTECTED]> wrote:
Hello, I have a question about Nutch.. I'm a total newbi and are wondering: Is it possible to setup nutch to crawl any address it finds, and only store pages where he finds something about a subject... I'll like to make a search place for ship/engine related material, and were thinking to start with .no domains... ( I have lots of time for this, ans the pages I'm looking for is not really getting "outdated", but i don't like to waste a lot of disk space etc. for pages which don't include what I'm looking for Best Regards Tor Harald Thorland
