Hi Karl,

I need to crawl sequence of (different) URLs from the same host, and each
URL defines next one to be crawled; I can crawl next URL only after
specified amount of time. URLs are different... of course I can use
Thread.currentThread.sleep() before calling
activities.addDocumentReference(newUrl) but it seems too naïve...
And this use case is much similar to generic Web crawl (when we need to be
polite, 2-3 seconds delay before recrawl from same domain)


-----Original Message-----
From: Karl Wright [mailto:daddy...@gmail.com] 
Sent: April-05-11 11:06 AM
To: connectors-user@incubator.apache.org
Subject: Re: How to add tast to queue dynamically (WebCrawler)

If you are trying to control the schedule for the FIRST time a document is
fetched, the IProcessActivity API doesn't permit that at this time.  You
would need to add a new version of
addDocumentReference() to the IProcessActivity interface, which allowed you
to set the scheduled processing time in addition to everything else.  The
internals for such a change should be straightforward since all the moving
parts are already there.

I'm curious, however, about your use case.  It is currently unheard of for
connectors to try to control the scheduling of all documents being fetched -
this would interfere with ManifoldCF's scheduling algorithms, which are
designed for maximum throughput.  I'd like to be sure your design makes
sense before I agree that this is a reasonable addition to the API.  Can you
explain the connector and its design so that I can see what you are trying
to accomplish?

Thanks!
Karl

On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <f...@efendi.ca> wrote:
>
> Hi Karl,
>
> So this is "retry"... can we schedule document retrieval? I retrieve 
> XML, generate new URL, and I want to schedule this new Document to be 
> retrieved at specific time -Fuad
>
>

Reply via email to