Re: [htdig-dev] htdig 4.0 updates

Neal Richter Fri, 09 Dec 2005 13:12:01 -0800

On Fri, 9 Dec 2005, Gustave Stresen-Reuter wrote:

> Neal,
> 
> I've been reading, with interest, the posts on the blog. I have a few 
> of questions so far.
> 
> - Is htdig a competitor to Nutch? If not, could you take a few minutes 
> to clarify the differences between the two?


  No! They will be complementary.

  I believe that HtDig is much easier to manage and has more 
clear flexibility of configuration than Nutch.  Nutch is a very powerfull 
in terms of it's scalability in both documents and simultanious searches.

  Nutch uses an apache/tomcat server to service requests.  It can (and 
has) scaled to 200 millions documents.  It's written in 100% Java as a 
full application built on Java Lucene.  It's great.

  However I do believe that getting a tomcat server up and running as well 
as having a JVM and other associated infastructure is a bit beyond the 
capabilites of a lot of our users.  It's not quite as simple as compiling 
and installing the binaries or installing a package.  I may be 
underestimating our users, but I base this assesment on reading the 
htdig-general list.

  HtDig 4.0 will be easy to configure install and/or install via RPM or 
other package manager.  It won't require a user to keep a server-daemon 
running.  And it will continue to provide a massive variety of flexible 
configuration options.  The addition of the CLucene library underneath 
will enable HtDig to achieve good scalability in documents.
 
 The way I see it HtDig 4.0 is for the classic use of a site-specific 
search engine for modestly sized websites that don't have tons of search 
hits per second.

 Nutch is for people who have large document sets and/or lots of search 
hits per unit time and need a multi-threaded server daemon to handle the 
load.

 FYI: Doug Cutting, the leader of Nutch & Lucene, was one of the original 
authors of Excite and has been doing IR for 15+ years.  Doug's aims are 
much higher in terms of what Nutch is.

> - What, if any, modifications to the ranking engine will be made in 4.0 
> (saw the note about back-links and anchor texts - what about incoming 
> links from other domains)?
> 
> - It seems the goal is to create a library that can be included in 
> other programs. Will the library include all the code for spidering, 
> creating the indexes, and searching or just the database creation 
> stuff, or something else...?

  HtDig is an application for users.  

We are architecting 4.0 in such a way so that it can be used as a library 
in other applications.  For a while KDE used a wrapper for the htdig 
binaries to enable document searching.  That was a big ugly hack.  I'd 
like to be able to have something that anyone including other open source 
projects can use to spider/index and search documents.

> - Are there any security considerations that should be addressed at 
> this early stage (sanitizing of URL parameters, for example)

  HtDig currently has a flexible AWK rule method for doing any URL 
manipulation you can think up.  I hope to provide a quick wrapper 
config for that that will ouput an AWK rule to specificaly strip a URL 
parameter (it's already done in some PHP code I wrote).

-- 
Neal Richter
Sr. Researcher and Machine Learning Lead
Software Development
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] htdig 4.0 updates

Reply via email to