RE: How to add tast to queue dynamically (WebCrawler)

Fuad Efendi Tue, 05 Apr 2011 09:24:34 -0700

Hi Karl,

Yes, for politeness; RSS & WebCrawl seems extremely rich (and hard to use as
a basic sample)
Thread.sleep() is fine... but since it is a web container (externally
managed thread pool) it's unsafe (especially if I don't know yet much
details); JEE strictly advocates avoid thread-level programming; and Java 6
has new features which we can use...

I found easy temporary solution, using static local "lastCrawlAttempt", and
then checking current time in processDocuments() method body (simply
returning); and also using explicit scheduling
activities.setDocumentScheduleBounds(newUrl, rescanTime, rescanTime, null,
null);
So that processDocuments doesn't do anything during specified delay
Just temporary workaround...

-Fuad

-----Original Message-----
From: Karl Wright [mailto:daddy...@gmail.com] 
Sent: April-05-11 11:53 AM
To: connectors-user@incubator.apache.org
Subject: Re: How to add tast to queue dynamically (WebCrawler)

Hi Fuad,

Ok, so this is for politeness?
I am sure you've looked at what the RSS and Web connectors do to enforce
politeness constraints.  As you probably know, the framework has the ability
to throttle all connections using AVERAGE fetch rate throttling (see the
"Throttling" tab for the connection).  But if you need to make sure you do
not exceed a MAXIMUM rate, the standard is to adopt logic similar to that
used by the RSS and Web connectors, which limit connection count as well as
maximum fetch rate by way of connector-based throttling.

I suppose that you may not like the Thread.sleep() you see in the throttling
code in the RSS and Web connectors.  Since these connectors are throttling
max connections as well as maximum fetch rate, it was not possible in all
cases to avoid Thread.sleep().  But I can see a case for trying to control
scheduling of documents for the purposes of enforcing a maximum fetch rate
alone.

In order for that to work, you'd need connector control over the schedule
for every way a document can be added to the job queue.  The
addDocumentReference() method is only one such case; you'd also want similar
functionality for addSeedDocuments().  I'd suggest creating a ticket for
this change to the API.  FWIW, I don't think this is a big win for either
web or rss crawling, since all that the Thread.sleep() does is reduce
(slightly) the number of available threads, so I'd prioritize it
accordingly.

Karl

On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <f...@efendi.ca> wrote:
> Hi Karl,
>
> I need to crawl sequence of (different) URLs from the same host, and 
> each URL defines next one to be crawled; I can crawl next URL only 
> after specified amount of time. URLs are different... of course I can 
> use
> Thread.currentThread.sleep() before calling
> activities.addDocumentReference(newUrl) but it seems too naïve...
> And this use case is much similar to generic Web crawl (when we need 
> to be polite, 2-3 seconds delay before recrawl from same domain)
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddy...@gmail.com]
> Sent: April-05-11 11:06 AM
> To: connectors-user@incubator.apache.org
> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>
> If you are trying to control the schedule for the FIRST time a 
> document is fetched, the IProcessActivity API doesn't permit that at 
> this time.  You would need to add a new version of
> addDocumentReference() to the IProcessActivity interface, which 
> allowed you to set the scheduled processing time in addition to 
> everything else.  The internals for such a change should be 
> straightforward since all the moving parts are already there.
>
> I'm curious, however, about your use case.  It is currently unheard of 
> for connectors to try to control the scheduling of all documents being 
> fetched - this would interfere with ManifoldCF's scheduling 
> algorithms, which are designed for maximum throughput.  I'd like to be 
> sure your design makes sense before I agree that this is a reasonable 
> addition to the API.  Can you explain the connector and its design so 
> that I can see what you are trying to accomplish?
>
> Thanks!
> Karl
>
> On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <f...@efendi.ca> wrote:
>>
>> Hi Karl,
>>
>> So this is "retry"... can we schedule document retrieval? I retrieve 
>> XML, generate new URL, and I want to schedule this new Document to be 
>> retrieved at specific time -Fuad
>>
>>
>
>

RE: How to add tast to queue dynamically (WebCrawler)

Reply via email to