hi,
> > > > Does the *core* really need access to the whole object graph? I > > totally agree that most specific implementations will need broader > > access. > > > > I think droids power will come from its flexibility / simplicity. > > Ideally the *core* will have as few dependencies as possible. > > > > I agree that sub-project/package that focuses on web crawling could > > depend on spring. > > I remember that we once had a dependency on spring from the core. > However Oleg had shown with the simpleRuntime that we do not need to > rely on spring in the core. I agree with Ryan and would like to > "outsource" this dependency to a component of its one. > it's just a little bit more convenient and make the code a little bit cleaner to directly depend on Spring.(e.g. with Spring's annotation) if not, as long as we design with a Dependency Injection mindset and make our components as JavaBean, we still could utilize Spring/any DI framework to wire the components in XML. the core probably doesn't need DI if it is just the individual components like parser, task master etc. For any module that need to wire things up, a DI framework will help. > > > > > > > - See the Link-centric design bullet for more info about link > > > > > > - Fetcher > > > - we do not use the concept of Fetcher now. I suppose it is > > > because > > > Droids is designed to do more than web crawling and non-web > > > resource is not > > > "fetched"? In Droids 0.01, Protocol basically represent the > > > Fetcher (or at > > > least, Protocol+Worker) > > > > +1 -- I think a Fetcher concept is a good idea. It should also be > > independent of the task interface. When thinking about the fetcher, > > it may be good to consider VFS (http://commons.apache.org/vfs/) as a > > first class implementation. > > I think I get the fetcher concept (getting a representation of the > underlying task) but not sure how vfs fits the picture. sounds easy to implement. There will be a VFSFetcher that take a single String VFS URL as configuration, that return a ContentEntity. (depends on the interface, or maybe just an inputstream) I guess we at least create the concept of Fetcher for the crawler use case. (if not in core) > > > > > agree. > > > > "Parser" and "Extractor" make sense, > > > > "Processor" and "Handler" are not clear to me. I know they each have > > functions that can be reused by each other, but the general terms get > > confusing. > > > > Hmm, http://www.thefreedictionary.com/handler in a parsing concept I > agree is connected with SAX but in our context it aims to express that > the task now can be passed over to the final stage. > > To get it straight you recommend that we drop handlers and only have > parser and extractor? The problem I see then is that the parsing stage > would be mandatory since its merges from there, right? > i mainly suggest we should define the concepts clearly as well as define the interface and the calling sequence. Whether it is called a handler or parser or we may parser extends handler, it's all just some names. in the current snapshot, the Handler interface sounds like a filter or a listener to me. It has an interface that take a URI and ContentEntity, and the parser has similar interface. I think we need to define its role clearly. Is there anything that shouldn't be done by a parser but by a handler? what's the flow / sequence of using the handler? if a handler works in a similar way as a parser, I suggest to merge them. for event-driven parser like SAX XML Parser, it also has a concept of ContentHandler, which got nothing to do with our Handler interface. btw, extractor is mainly for crawling. > > > > > I will defer to others on the Entity discussion... I am not really > > familiar with the concepts > > Like said above the entity part of droisd had been implemented by oleg > which is most familiar with the concept. > make it simple. I just raise the question of we want intend to achieve though wrapping the HttpEntity from the HC project. i think: - imho, HttpEntity object is mainly for holding the inputstream, and store some meta data for the inputstream such as content type, content length and encoding. - just look at the API, as it provides getMimeType(), getCharset(), getParse(), mimetype is http content type, charset is encoding. for parse, i suppose it stores arbitary data. (btw, I think the concept of Parse should be reviewed, too) - so, it seems to me the main purpose it just to make it not tie to HTTP. conceptually, it's ok. and people who learn Droids will need to learn both Droids' entity and HC's entity. I personally prefer to use the HC directly if there is no major value-added. The name is not "politically correct" for us, but no big deal. (esp I use Droids for web crawling.. :-) ) - if we make the Link / LinkTask as a Map, then we already have a vehicle to store arbitary data. And if we also standardize the interface to take the Link/LinkTask instead of URI, we have a place to store any data that need to be associated with the Entity/ContentEntity (that always associate with a Link/LinkTask) > > > > - if we keep TaskMaster, i suggest to make it implements > > > ExecutorService, and we depends on Java util/concurrency API > > > rather than a > > > TaskMaster interface. > > > > > > > seems good. > a component to manage the queue, thread, and execution, which is what the Task Master is designed to be, is needed. I suggest to make in this way: - we have the Task Master in the core. - for the crawler/web-crawler module, the Crawler may extends a Task Master if suitable. (rather than a Crawler has a Task Master) > > > > > > > > - Queue > > > - I suggest to remove the TaskQueue interface and use Queue<? > > > extends Link> as standard signature. > > > > > > > +1 > > +1 > basically, many of my proposal are just to remove/make-optional some interfaces from Droids. So when ppl pick up Droids, they need to learn less things. TaskQueue -> Queue and TaskMaster->ExecutorService are two examples. > > > > > > > > > > 8. Removed concepts based on the above proposal > > > - LinkTask, Task, just keep Link > > I would prefer to keep task since it is more generic. how about this, we have: - Task in core - has an "id" field (in Long?) - extends HashMap - implements Serializable - Link in crawler module - extends Task - has a "url" field in String (or URI) - re the name, ppl who read the interface knows it's a Task anyway. calling LinkTask as is is also ok, and people will ask, what exactly the Task in the LinkTask name means. For crawling, we may use only just URL for the queue, and Task is just something generic and becomes vague in the context of a crawler. it's understandable the naming of components in a generic robot framework and a crawler framework are different. > > > > > > > > any comment? > > > > > > > > In general sounds good. As for flushing out large changes like this > > -- i think we should discuss it a bit more to make sure everyone is on > > the same page. Then it probably makes sense to start a personal > > sandbox: > > http://svn.apache.org/repos/asf/incubator/droids/sandbox/mingfa > > where we can see some things in action and then look at migrating > > things together. > > > Yeah this way we can as well promote the proven code to the main tree. > > Thanks for the feedback. I've put some code to the svn already. As I doubt if anyone would be interested to run that, I treat them as an example to illustrate my suggestions. It's better just to read the JavaDoc. ( http://people.apache.org/~mingfai/javadoc/ the package is wrong, just ignore it) Thanks for the comments. regards, mingfai
