On Fri, 2009-09-11 at 11:16 +0100, Tony Dietrich wrote: > See my previous post. > > Websphinx is a good reference package for this structure. Whoever wrote it > was quite sensible.
https://issues.apache.org/jira/browse/DROIDS-58 Maybe you want to comment on this issue with the findings of websphinx. salu2 > > Tony > > -----Original Message----- > From: Bertil Chapuis [mailto:[email protected]] > Sent: 11 September 2009 11:12 > To: [email protected] > Subject: RE: Customizable Solr Handle > > I just had a look on the Droids architecture and I ask me if the Parser > could not be considered as a Handler because in most handlers we will > have to parse the content again. > > Doing this could lead to a simple generic filter mechanism. When > executed, the Worker receive an object (Link, FileTask, etc) and for > each Handlers test with the Filter(s) if the Handler should be executed > or not. > > filter.shouldExecute(Link, Handler){...} > > What do you think about that? It could be a nice way to keep things > simple and modular. > > Best regards, > > Bertil > > > > On Wed, 2009-09-09 at 11:15 +0100, Tony Dietrich wrote: > > Haven't got time atm to look at this myself, but there's a nice approach > to this sort problem (of what to do with pages that (need to be)|(have been > crawled) ) in the old websphinx package. > > > > If I remember rightly, the package uses predicate classes (which can be > standardised or sub-classed) and which return true/false in certain > conditions, and methods in the crawler class which determine what actions > are taken under which circumstances. > > Ie > > public boolean shouldVisit(..){..} > > public boolean shouldDownload(..){..} > > public boolean shouldProcess(..){..} > > with each method calling a declared predicate class (or chain of classes, > depending on whether the implementation contains sub classed predicates.) > > > > Perhaps a similar approach could be used for droids, since it very nicely > provides a standards-acceptable, extensible approach to this sort of > problem. > > > > Perhaps overkill for Bertil's problem, but for future implementations > ..... > > > > Tony > > > > > > -----Original Message----- > > From: Thorsten Scherler [mailto:[email protected]] > > > Sent: 09 September 2009 11:11 > > To: [email protected] > > Subject: Re: Customizable Solr Handle > > > > On Wed, 2009-09-09 at 10:38 +0200, Bertil Chapuis wrote: > > > Hello, > > > > > > My name is Bertil Chapuis. I am using droids for a personal project and > > > I am trying to create a more customizable solr handler. > > > > Hi Bertil, nice to have you on this list. > > > > > > > > I posted a ticket with my code (DROIDS-62). However, I am looking for a > > > way to filter the handler's execution. I'd like to handle the documents > > > only if their URI or content matches specific conditions. > > > > I will have a look at your patch, thanks in advance for your > > contribution. > > > > > > > > For example, the document is handled only if its uri matches the > > > following regex: > > > > > > http://www.awebsite.com/document-[0-9]*.htm > > > > > > What's the best way to do that? Is it delegated to the handler's > > > implementation or is there a standard way? > > > > Mingfai has this filter approach theoretically included in our next > > version. However right now we do not have a standard approach other then > > implementing the validation logic in e.g. the queue. The question is > > whether you want only to crawl the pages that are valid against your > > regex or the limitation is only for the handler. > > > > If it is only for the handler then it is maybe best to implement it in > > your worker. Something like: > > ... > > public void execute(Link link) throws DroidsException, IOException { > > > > ... > > URI uri = link.getURI(); > > Pattern pattern = Pattern.compile(PATTERN); > > Matcher matcher = pattern.matcher(uri); > > if (matcher.find()) { > > droid.getHandlerFactory().handle(link.getURI(), entity); > > } > > ...} > > > > > > HTH > > > > salu2 > > > > > > > > Best regards, > > > > > > Bertil Chapuis > > > > > > > -- Thorsten Scherler <thorsten.at.apache.org> Open Source Java <consulting, training and solutions> Sociedad Andaluza para el Desarrollo de la Sociedad de la Información, S.A.U. (SADESI)
