I'll try to answer (or dodge) some these questions.

- Is htdig a competitor to Nutch? If not, could you take a few minutes 
to clarify the differences between the two?

This is a good one for Neal to answer. I can tell you that I'm expecting
the new ht://Dig to epitomize a fast, lightweight and scalable
domain-specific search engine. Nutch, Omega and similar projects all
have their strengths (again, maybe Neal can talk about that), but one of
the big strengths of ht://Dig is the vast array of options and settings
that are available to the user. While some of these are going away
because they are no longer applicable, we are committed to keeping as
many of the nice bells and whistles as we can.


- What, if any, modifications to the ranking engine will be made in 4.0 
(saw the note about back-links and anchor texts - what about incoming 
links from other domains)?

The ranking engine will be moved over to CLucene. Right now, the CLucene
database contains anything we want (the API is highly extensible), and
we're working on making things like backlink counts and link
descriptions work efficiently. As for external domain links, that is
really outside the scope of ht://Dig, since it is primarily a
single-site (or small group of sites) crawler.


- It seems the goal is to create a library that can be included in 
other programs. Will the library include all the code for spidering, 
creating the indexes, and searching or just the database creation 
stuff, or something else...?

Creating a library is exactly what we're shooting for. It will contain
the ability to spider and push documents into a CLucene database. For
searching, we essentially want to be able to stick any appropriate
wrapper on top of ht://Dig and be able to do searches. I've written
about this on the blog, but what I'd like to do is separate the htsearch
options from the htdisplay options. Search options can be sent down to
the library, and search results can be returned in some kind of XML
format to the wrapper. The wrapper can do whatever it wants with the
results as far as cgi and pretty print.

Since we're still in beta (or alpha since I keep writing stupid bugs),
we're using Luke to verify index creation and validity. Luke
(http://www.getopt.org/luke/) is a toolbox designed to interact with
Java Lucene indexes, but since CLucene follows the standard, we use it
for our own purposes.


- Are there any security considerations that should be addressed at 
this early stage (sanitizing of URL parameters, for example)

Uhh... Neal?




Anyway, I'm planning on making a tag in CVS that everyone can download
and try soon. There is a htdig_4_0 branch right now, but it is lacking
certain parts - namely the CLucene back end. We're working on adding
CLucene to the make scripts; right now we're doing builds the hard way.

I hope this answered some of your questions, and I hope that Neal can
step in and answer a few more. I've been bad about updating the blog on
a regular basis, but hopefully I can get myself in gear and let everyone
know the day-to-day progress. Feel free to leave comments on my posts,
too.


Anthony






-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Gustave
Stresen-Reuter
Sent: Friday, December 09, 2005 5:08 AM
To: Richter, Neal
Cc: htdig-dev@lists.sourceforge.net
Subject: Re: [htdig-dev] htdig 4.0 updates

Neal,

I've been reading, with interest, the posts on the blog. I have a few 
of questions so far.

- Is htdig a competitor to Nutch? If not, could you take a few minutes 
to clarify the differences between the two?

- What, if any, modifications to the ranking engine will be made in 4.0 
(saw the note about back-links and anchor texts - what about incoming 
links from other domains)?

- It seems the goal is to create a library that can be included in 
other programs. Will the library include all the code for spidering, 
creating the indexes, and searching or just the database creation 
stuff, or something else...?

- Are there any security considerations that should be addressed at 
this early stage (sanitizing of URL parameters, for example)

I'm not a C developer, but I'm more than happy to try building the 
project on Linux and Mac OS X (10.3). Is there a 4.0 branch in CVS or 
will we have to wait for you to tag it?

Thanks for the work.

Gustave (Ted) Stresen-Reuter

On Dec 8, 2005, at 6:05 PM, Neal Richter wrote:

> Hey all,
>
>   We've been making good progress on HtDig 4.0
>
>   You can see the progress updates on this blog.
>
>   http://htdig.blogspot.com/
>
>   Thanks.
>
> -- 
> Neal Richter
> Sr. Researcher and Machine Learning Lead
> Software Development
> RightNow Technologies, Inc.
> Customer Service for Every Web Site
> Office: 406-522-1485
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log

> files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  DOWNLOAD
SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> ht://Dig Developer mailing list:
> htdig-dev@lists.sourceforge.net
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-dev



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to