On Mar 31, 2009, at 12:38 PM, Robin Howlett wrote:
Hello,
I've only really taken an introductory look at Droids and ran
through the
samples. I think I'll be using Droids for an upcoming project. I
have a
couple of questions first:
I ran both the SimpleRuntime example and the Cli example through a
site I
wish to parse. Droids seems to keep an index of the links in the
page to
parse and those parsed already - where is that list? In memory? Is
it the
queue? How big can that queue grow to?
the Simple Queue included in Droids is just an in memory
ConcurrentHashMap.
The site I will be crawling will be around 500,000 pages - is this a
number
that could be supported? Can the index be persisted using a DB
instead of
being stored in memory?
Yes, the interface is easy to implement with a DB backend:
http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-core/src/main/java/org/apache/droids/api/TaskQueue.java
When I use droids, this is what I use -- it has become too domain
specific for me to give back anything too useful now. We should look
into adding something into the core that persists to something -- SQL,
ehcache, whatever.
Some of the links to content I wish to crawl/parse/index are
JavaScript pop
ups - therefore I wish to alter the url for the crawler to use; this
should
be no problem right?
Should not be a problem -- if you can find the URLs from the parse
data you can add them to the Queue
ryan