Hi Karl, Yes, for politeness; RSS & WebCrawl seems extremely rich (and hard to use as a basic sample) Thread.sleep() is fine... but since it is a web container (externally managed thread pool) it's unsafe (especially if I don't know yet much details); JEE strictly advocates avoid thread-level programming; and Java 6 has new features which we can use...
I found easy temporary solution, using static local "lastCrawlAttempt", and then checking current time in processDocuments() method body (simply returning); and also using explicit scheduling activities.setDocumentScheduleBounds(newUrl, rescanTime, rescanTime, null, null); So that processDocuments doesn't do anything during specified delay Just temporary workaround... -Fuad -----Original Message----- From: Karl Wright [mailto:daddy...@gmail.com] Sent: April-05-11 11:53 AM To: connectors-user@incubator.apache.org Subject: Re: How to add tast to queue dynamically (WebCrawler) Hi Fuad, Ok, so this is for politeness? I am sure you've looked at what the RSS and Web connectors do to enforce politeness constraints. As you probably know, the framework has the ability to throttle all connections using AVERAGE fetch rate throttling (see the "Throttling" tab for the connection). But if you need to make sure you do not exceed a MAXIMUM rate, the standard is to adopt logic similar to that used by the RSS and Web connectors, which limit connection count as well as maximum fetch rate by way of connector-based throttling. I suppose that you may not like the Thread.sleep() you see in the throttling code in the RSS and Web connectors. Since these connectors are throttling max connections as well as maximum fetch rate, it was not possible in all cases to avoid Thread.sleep(). But I can see a case for trying to control scheduling of documents for the purposes of enforcing a maximum fetch rate alone. In order for that to work, you'd need connector control over the schedule for every way a document can be added to the job queue. The addDocumentReference() method is only one such case; you'd also want similar functionality for addSeedDocuments(). I'd suggest creating a ticket for this change to the API. FWIW, I don't think this is a big win for either web or rss crawling, since all that the Thread.sleep() does is reduce (slightly) the number of available threads, so I'd prioritize it accordingly. Karl On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <f...@efendi.ca> wrote: > Hi Karl, > > I need to crawl sequence of (different) URLs from the same host, and > each URL defines next one to be crawled; I can crawl next URL only > after specified amount of time. URLs are different... of course I can > use > Thread.currentThread.sleep() before calling > activities.addDocumentReference(newUrl) but it seems too naïve... > And this use case is much similar to generic Web crawl (when we need > to be polite, 2-3 seconds delay before recrawl from same domain) > > > -----Original Message----- > From: Karl Wright [mailto:daddy...@gmail.com] > Sent: April-05-11 11:06 AM > To: connectors-user@incubator.apache.org > Subject: Re: How to add tast to queue dynamically (WebCrawler) > > If you are trying to control the schedule for the FIRST time a > document is fetched, the IProcessActivity API doesn't permit that at > this time. You would need to add a new version of > addDocumentReference() to the IProcessActivity interface, which > allowed you to set the scheduled processing time in addition to > everything else. The internals for such a change should be > straightforward since all the moving parts are already there. > > I'm curious, however, about your use case. It is currently unheard of > for connectors to try to control the scheduling of all documents being > fetched - this would interfere with ManifoldCF's scheduling > algorithms, which are designed for maximum throughput. I'd like to be > sure your design makes sense before I agree that this is a reasonable > addition to the API. Can you explain the connector and its design so > that I can see what you are trying to accomplish? > > Thanks! > Karl > > On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <f...@efendi.ca> wrote: >> >> Hi Karl, >> >> So this is "retry"... can we schedule document retrieval? I retrieve >> XML, generate new URL, and I want to schedule this new Document to be >> retrieved at specific time -Fuad >> >> > >