Jesse Erlbaum scribbled on 3/19/07 11:07 AM:
Hi Peter --
As one of the Swish-e developers, I can second Michael's
endorsement. ;)
I've used switch-e many times over the years. Do you think it is
lacking in any way compared to Xapian or Plucene? What about compared
to commercial systems such as Verity or Autonomy?
I can't compare Swish-e to any commercial systems since I haven't used them.
As Tim just posted, I wouldn't even consider Plucene at this point in its life
cycle, unless you're talking about a small, fairly static site. Performance is
Not Good.
Compared to Xapian: Swish-e doesn't do UTF-8 or good incremental indexing.
Swish-e is very fast at both indexing and search; I did see a benchmark once
that showed Swish-e was faster than Xapian but that was some time ago.
Xapian svn has a UTF-8 version; it's due to be official with the 1.0 release,
whenever that is.
Xapian does have good incremental indexing. It's also a library (unlike Swish-e)
so it has some more flexibility.
Swish-e has a built-in HTML/XML parser, which is pretty good (especially if you
use libxml2). Xapian has Omega, an additional package that does some HTML
parsing iirc.
I find Swish-e "just works" a little more "out of the box" than Xapian, but the
two big features (UTF-8 and increm indexing) are a show-stopper if your project
requires those.
See also my article here, comparing Lucene, Xapian and Swish-e:
http://dewey.library.nd.edu/mylibrary/manual/ch/ch17.html
I've felt that Switch-e was a bit "long in the tooth" owing to its
legacy. Do you disagree? If you were not deeply involved in the
Switch-e project, would you choose it over Xapian or Plucene (or any
other system)?
Swish-e is old, true. The 2.x versions added a lot of features, but those are
getting on 6 years old now too. Still, things that Work don't need to be New, do
they? ;)
Would I choose it over Plucene? Not a question. Xapian? Well, it would depend on
a couple things:
(1) data set. Am I indexing data that is fairly static or mostly dynamic?
Example: static HTML or PDF docs, vs providing fulltext search for a database.
Swish-e is fast enough and has merging and multi-index search features that let
you get around the lack of incremental indexing, so if your data doesn't change
much, I'd go with Swish-e. If it does change a lot (e.g., you need to update
your index everytime you update your db), then I'd probably go with Xapian.
(2) i18n. Swish-e was first written back in the mid90s so Unicode wasn't even a
consideration. There are lots of optimizations in the C code that assume 1 byte
= 1 character and so things like UTF-8 Just Don't Work.
You can get around that (as I do) with things like Search::Tools::Transliterate
(shameless plug) but if you need Real I18N Support, I'd be going with Xapian.
That said, I would suggest Xapian over Plucene, hands down.
Why do you say? Any particular gripes?
See Tim's recent post.
And I would also check out KinoSearch, which is (along with
Xapian) going to be one of the optional backend IR libraries
for the next version of Swish-e.
I've heard of KinoSearch before, but I've not tried it out. It seemed
that Plucene and Xapian had more established Perl interfaces, but I
could be wrong.
Tim summarized it well. KinoSearch is new-ish, but Marvin is really cranking out
some quality stuff. And it's all C and Perl, so if those are your primary
languages, the barrier to hacking on it yourself is all the lower.
If I had to do a full UTF-8 compatible, robust, incremental, highly scalable
search application tomorrow, I'd be looking seriously at KS and Xapian. Which is
why Swish-e will offer those 2 (among others) as backends for version 3. :)
pek
--
Peter Karman . http://peknet.com/ . [EMAIL PROTECTED]
---------------------------------------------------------------------
Web Archive: http://www.mail-archive.com/[email protected]/
http://marc.theaimsgroup.com/?l=cgiapp&r=1&w=2
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]