Re: [htdig-dev] HTML Parser and start of CLucene conversion

2005-02-02 Thread Neal Richter

4) According to the PyLucene author, converting Java Lucene to a
  native library with gcj, then calling that library from a C
  program is hopelessly hairy and not recommended. Too bad.
5) To my eye, Nutch does not look particularly rich in features or
  configurability compared to HtDig.
  True, but it's known 200+ million document scalability can't be beat. 
And it's being supported by Yahoo Labs.

6) Word on the street is Xapian is the only competition to Lucene
  in terms of scalability in terms of Free Software search
  cores. Gmane uses Xapian against 20+ million documents.
  Xapian is GPL, Lucene/CLucene is LGPL.  Evidently the Xapian people 
didn't read the 4th paragraph of http://www.gnu.org/philosophy/why-not-lgpl.html

Using the ordinary GPL is not advantageous for every library. There are
reasons that can make it better to use the Library GPL in certain cases. 
The most common case is when a free library's features are readily 
available for proprietary software through other alternative libraries. In that 
case, the library cannot give free software any particular advantage, so it is 
better to use the Library GPL for that library.

  Of course you (Jeff) and I dissagreed on this point a while back ;-)
  That said Xapian does look impressive.
Anyway, I'm delighted to hear about this HtDig/Lucene experiment.
Points #1, #2, and #3 suggest it may make sense to consider the idea
of a pure Java HtDig which can be gcj compiled to native executables. From
my perspective as a naive HtDig user I think that would rock, but
there's probably lots of stuff I'm not thinking about. If anyone wants
to try out the gcj/Lucene thing Doug Cutting's instructions [*] work
fine provided you have gcj 3.4.x installed.
  If we really wanted a pure Java HtDig, I think we'd be better off 
throwing in with Nutch and adding the configurability of HtDig to it.

  As I see it, the primary reason that Nutch is somewhat unattractive to 
the average HtDig user is that they must know how to configure Nutch to run as 
Tomcat service, or know how to tweak the build system to build as a 
standalone server.  Either is easy for a more novice user given their 
current build system and 'How-To' docs.

  HtDig is still a forked CGI app, which means that our 
users don't have to worry about starting/monitoring a server daemon.  If 
we were to throw in with Nutch at some future date, it would be nice to 
make a simple option for Nutch to be built as a forked CGI app.

  I've looked at attempting to go the PyLucene route and compile Java with 
gjc and create the hairy wrapper libs for it.  It is ugly for many 
reasons.

  Going with CLucene at first has the advantage that we can get the 
code reorg done, and look at replacing the CLucene APIs with the 
equivalent Java-Lucene+Wrapper ones.. if it is even worth doing that.

  Thanks.
--
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485


---
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag--drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
___
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev


Re: [htdig-dev] HTML Parser and start of CLucene conversion

2005-02-02 Thread Neal Richter

 As I see it, the primary reason that Nutch is somewhat unattractive to the 
average HtDig user is that they must know how to configure Nutch to run as 
Tomcat service, or know how to tweak the build system to build as a 
standalone server.  Either is easy for a more novice user given their current 
build system and 'How-To' docs.
  Ha!  The above 'Either' should be 'Neither'.  Those two options are 
definetly /not/ easy for the average user to make work.

--
Neal Richter 
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485


---
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag--drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
___
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev