On 9/21/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Benjamin Higgins wrote:
> How can I instruct Nutch to refetch specific files and then update the
> index
> entries for those files?
>
> I am indexing files on a fileserver and I am able to produce a report of
> changed files about every 30 minutes.
>
> I'd like to feed that into Nutch at approximately the same interval so
> I can
> keep the index up-to-date.
>
> Thanks.

Conceptually this should be easy - you just need to generate a fetchlist
directly from your list of changed files, and not through
injecting/generating from a crawldb.

I wrote a tool for 0.7 which does this - look at the NUTCH-68 issue in
JIRA. This would have to be ported to 0.8 - check how Injector does this
in the first stage, when it converts a simple text file to a MapFile.

Would an algorithm like this make any sense:
for each URL in txt file
 if URL in crawldb
   update the date to "now()+1" in it's crawl datum
 else
   use existing inject logic to inject the new url

After that, it's only a matter of running the recrawl script with -adddays 0.

t.n.a.

Reply via email to