Hey all,

I'm starting a proof of concept conversion to use CLucene to replace the db.words.db ... it will still use BDB for the other db files.

        The excerpts DB can likely be eliminated as well.

In my tests so far I can index 12,000 files in 4 minutes producing a 35MG Lucene Index. The equivalent compressed standard index files total 63MG and take 40 minutes to create.

Note that this is a loop inserting documents with a 'simple_doc_insert' libhtdig api... not a spidering run.

In my code jockeying so far I notice that our HTML::parse() is fairly inefficient... it filters out the text between no_index_start & no_index_end tags, copies it then parses the copy. Seems obvious that we can do this in the second while loop with state variables like we do for TITLE tags. This would eliminate one linear scan through each document and needless memory usage.

Part of the reason for looking at how the HTML::parse is working is to make it unicode kosher by eliminating 'char' dependencies.

If anyone wants to help with the parse thing feel free. All you need is to create a UTF8 HTML document with a multi-byte character and make sure the character is preserved in the debug output... we can worry about what happens to the word later.

For a refresher the reasons for strongly considering converting to CLucene are a follows:
* It's UTF8/Unicode capable
* Potentially much faster
* Supports field-based searching
* Allows us to ditch our nonstandard BDB 3.0.55
* Allows us to ditch thousands of lines of code in favor of an
active project


If anyone seriously wants to help with this effort, I'd be willing to purchase a couple copies of Lucene in Action by Erik Hatcher and Otis Gospodnetic for people. It covers the java version, however the C++ version attempts to be a strict translation to C++ in every way. My copy is on the way.

At the moment I am concentrating on the spidering & indexing code to measure speed and size differences. Searching will come after that.

        Thanks.

--
Neal Richter Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485




-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl
_______________________________________________
ht://Dig Developer mailing list:
htdig-dev@lists.sourceforge.net
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to