On Fri, Sep 02, 2005 at 05:05:33AM -0400, Henri Yandell wrote:
> On 9/2/05, Oleg Kalnichevski <[EMAIL PROTECTED]> wrote:
> > On Thu, Sep 01, 2005 at 10:30:29PM -0400, Henri Yandell wrote:
> > > Never got round to adding it to Commons, robots.txt parser:
> > >
> > > http://www.osjava.org/norbert/ -> 
> > > http://www.robotstxt.org/wc/norobots-rfc.html
> > >
> > > Web-spider:
> > >
> > > http://www.osjava.org/scraping-engine/
> > >
> > > HTML pseudo-scraper (probably more for Jakarta Silk/Web Components):
> > >
> > > http://www.osjava.org/genjava/multiprojects/gj-scrape/ (poor site at
> > > the moment, it's a substring()/indexOf() parsing system instead of
> > > trying to be fancy).
> > >
> > > Hen
> > >
> > 
> > Henri,
> > 
> > I think a web spider and robots.txt parser would be a welcome addition
> > to the project. If you are personally interested in porting these
> > applications to use HttpClient / Http Components go ahead and add the
> > web spider to the project goals and yourself to the list of intitial
> > committers. In my opinion voting you in to a committer status is a
> > matter of formality
> 
> The robots.txt parser has a single GET request currently using
> HttpUrlConnection, so moving this to use HttpClient is pretty easy (if
> even thought necessary, adding the dependency for one method call is
> usually overkill). Will go ahead and add this to the list as it has
> very little religion.
>

There's absolutely no need to couple the parser with any other modules
in the project. A simple interface "get me that damn file" can used to
plug-in any physical transport code


> The web-spider might want a bit more investigation on the community's
> part. It had its guts ripped out to form a kind of container project
> called oscube so has a dependency on that, and might be scoped a bit
> beyond what Http Components would want from a spider. Cron via Quartz,
> notification, database storing etc. It already uses HttpClient for its
> fetching there (along with Commons Net for FTP).
> 
> http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/scraping-engine/xdocs/manual/images/Scrapers.png?rev=1967&view=auto
> 
> So a bit more than the simple wget clone that might have been
> envisioned.  :) Plan is to add a mini-scraping language to it, support
> POP and possibly end up with some kind of rules engine/job language. A
> lot of religion for HttpClient to swallow, but it is there if it
> piques interest.

Anything that is not strictly transport related, even the HTML scraper,
should not at least initially be a part of the project. I still do think
that a simple spider framework that relies on a bunch of callbacks to do
the HTML scraping, scheduling and persistence can fit well in the
project scope and represent a valuable addition to the project

Again, we should take this on board only if we have additional project
contributors to do all the actual coding as we are already spead too
thin. 

Just my take for what it is worth

Oleg


> 
> Hen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to