On Fri, 9 Dec 2005, Gustave Stresen-Reuter wrote: > Neal, > > I've been reading, with interest, the posts on the blog. I have a few > of questions so far. > > - Is htdig a competitor to Nutch? If not, could you take a few minutes > to clarify the differences between the two?
No! They will be complementary. I believe that HtDig is much easier to manage and has more clear flexibility of configuration than Nutch. Nutch is a very powerfull in terms of it's scalability in both documents and simultanious searches. Nutch uses an apache/tomcat server to service requests. It can (and has) scaled to 200 millions documents. It's written in 100% Java as a full application built on Java Lucene. It's great. However I do believe that getting a tomcat server up and running as well as having a JVM and other associated infastructure is a bit beyond the capabilites of a lot of our users. It's not quite as simple as compiling and installing the binaries or installing a package. I may be underestimating our users, but I base this assesment on reading the htdig-general list. HtDig 4.0 will be easy to configure install and/or install via RPM or other package manager. It won't require a user to keep a server-daemon running. And it will continue to provide a massive variety of flexible configuration options. The addition of the CLucene library underneath will enable HtDig to achieve good scalability in documents. The way I see it HtDig 4.0 is for the classic use of a site-specific search engine for modestly sized websites that don't have tons of search hits per second. Nutch is for people who have large document sets and/or lots of search hits per unit time and need a multi-threaded server daemon to handle the load. FYI: Doug Cutting, the leader of Nutch & Lucene, was one of the original authors of Excite and has been doing IR for 15+ years. Doug's aims are much higher in terms of what Nutch is. > - What, if any, modifications to the ranking engine will be made in 4.0 > (saw the note about back-links and anchor texts - what about incoming > links from other domains)? > > - It seems the goal is to create a library that can be included in > other programs. Will the library include all the code for spidering, > creating the indexes, and searching or just the database creation > stuff, or something else...? HtDig is an application for users. We are architecting 4.0 in such a way so that it can be used as a library in other applications. For a while KDE used a wrapper for the htdig binaries to enable document searching. That was a big ugly hack. I'd like to be able to have something that anyone including other open source projects can use to spider/index and search documents. > - Are there any security considerations that should be addressed at > this early stage (sanitizing of URL parameters, for example) HtDig currently has a flexible AWK rule method for doing any URL manipulation you can think up. I hope to provide a quick wrapper config for that that will ouput an AWK rule to specificaly strip a URL parameter (it's already done in some PHP code I wrote). -- Neal Richter Sr. Researcher and Machine Learning Lead Software Development RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ ht://Dig Developer mailing list: htdig-dev@lists.sourceforge.net List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev