Hey Murad, I will be more than willing to share experiences and give back the code once we have a little more experience. Right now we are in more of a planning/beginning development phase, so I don't have much to share.
I would like to share our vision for our search engine and ask for input. We would like to crawl internal sites (form authenticated), some external sites and some non-traditional sites (rss, mailing-lists, newsgroups). The rest of our app is in php so we would prefer to put a web-service around nutch. I have used code based off what Matt?? listed in the mailing list (using HttpClient) for crawling internal and external sites. So one suggestion I have is moving towards using HttpClient if possible (I haven't tested it thouroughly yet, but in my simple tests it appears to work). This would allow us to have http authentication, and remove the re-implement the wheel syndrome (if this is possible). My code is currently not generalized enough to contribute back. Crawling external sites should be easy, I think we will be pruning pages that aren't relevant to us, and it appears that there has been code recently contributed that will allow this. So that leaves the web-service and non-traditional sites. Re: the web-service. I sent out an email requesting help/direction in developing this. I really didn't get much response, so I'll probably ask again. If there is no current effort to use the jmx method (of which I'm not familiar) we will just end up doing another axis effort. Re: non-trad sites. I'm calling rss, mailinglists and newsgroups non-traditional sites (not because you can't get to them through http directly but because I'm not convinced that it the correct way to crawl them). Our current idea is to set up an account to suscribe to mailing-lists and rss feeds and then push them into the search engine. (I imagine a web-service method like addContent(String[] urls, String[] content, ...) that will allow one to add arbitrary content to the search engine on the fly.) If anyone has comments/ideas around this or are interested in designing/implemented this solution, the lets talk. Of course we will also need to test our solution. Our solution is relatively small since we are not doing a wide crawl, but just focused crawling. Our stress testing will proably consist of httpunit. thanks On Thu, 4 Nov 2004 12:00:13 -0800, Murad Goeksel <[EMAIL PROTECTED]> wrote: > > > > Anecdotal evidence suggests that one of the obstacles to more Nutch adoption > is the lack of case studies and documented best practices. It's difficult to > plan and justify a significant deployment, especially to business managers, > when one cannot reference other people's experiences. > > > > After consulting with Doug Cutting, we decided to serve the Nutch community > by preparing and publishing case studies of Nutch deployments. We'd like to > make the case studies at least as professional as those from Verity and > mySQL. > > > > If you have implemented Nutch in a significant project, we'd like to hear > from you. You get to brag about your achievements while the Nutch community > learns from your experience. To contribute, please contact Murad Goeksel at > [EMAIL PROTECTED] Thank you. > > > > > > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > > We are here to celebrate the sun, > > > the moon, and high technology. > > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > > ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
