Re: [Nutch-general] Nutch Case Studies

m h Tue, 09 Nov 2004 10:08:22 -0800

Hey Murad,

I will be more than willing to share experiences and give back the
code once we have a little more experience.  Right now we are in more
of a planning/beginning development phase, so I don't have much to
share.

I would like to share our vision for our search engine and ask for input.

We would like to crawl internal sites (form authenticated), some
external sites and some non-traditional sites (rss, mailing-lists,
newsgroups).  The rest of our app is in php so we would prefer to put
a web-service around nutch.

I have used code based off what Matt?? listed in the mailing list
(using HttpClient) for crawling internal and external sites.  So one
suggestion I have is moving towards using HttpClient if possible (I
haven't tested it thouroughly yet, but in my simple tests it appears
to work).  This would allow us to have http authentication, and remove
the re-implement the wheel syndrome (if this is possible).  My code is
currently not generalized enough to contribute back.

Crawling external sites should be easy, I think we will be pruning
pages that aren't relevant to us, and it appears that there has been
code recently contributed that will allow this.

So that leaves the web-service and non-traditional sites.

Re: the web-service.  I sent out an email requesting help/direction in
developing this.  I really didn't get much response, so I'll probably
ask again.  If there is no current effort to use the jmx method (of
which I'm not familiar) we will just end up doing another axis effort.

Re: non-trad sites.  I'm calling rss, mailinglists and newsgroups
non-traditional sites (not because you can't get to them through http
directly but because I'm not convinced that it the correct way to
crawl them).  Our current idea is to set up an account to suscribe to
mailing-lists and rss feeds and then push them into the search engine.
(I imagine a web-service method like addContent(String[] urls,
String[] content, ...) that will allow one to add arbitrary content to
the search engine on the fly.)  If anyone has comments/ideas around
this or are interested in designing/implemented this solution, the
lets talk.

Of course we will also need to test our solution.  Our solution is
relatively small since we are not doing a wide crawl, but just focused
crawling.   Our stress testing will proably consist of httpunit.

thanks

On Thu, 4 Nov 2004 12:00:13 -0800, Murad Goeksel <[EMAIL PROTECTED]> wrote:
>  
>  
> 
> Anecdotal evidence suggests that one of the obstacles to more Nutch adoption
> is the lack of case studies and documented best practices. It's difficult to
> plan and justify a significant deployment, especially to business managers,
> when one cannot reference other people's experiences. 
> 
>   
> 
> After consulting with Doug Cutting, we decided to serve the Nutch community
> by preparing and publishing case studies of Nutch deployments. We'd like to
> make the case studies at least as professional as those from Verity and
> mySQL. 
> 
>   
> 
> If you have implemented Nutch in a significant project, we'd like to hear
> from you. You get to brag about your achievements while the Nutch community
> learns from your experience. To contribute, please contact Murad Goeksel at
> [EMAIL PROTECTED]  Thank you. 
> 
>   
> 
>   
> 
> 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
> 
> 
> We are here to celebrate the sun, 
> 
> 
> the moon, and high technology. 
> 
> 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
> 
>

-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch Case Studies

Reply via email to