Chris Anderson wrote:
Sphinx is not the best contender for integration, because of it's
limited support for incremental updates. It is, however, a good
boundary condition on how to design the Indexer API so that a wide
range of search engines can work with CouchDB.
Sphinx is going to support real time updates in one of the next few
releases so that won't be a problem much longer.
However there's a different problem with using Sphinx to search CouchDB:
Sphinx is not designed to index documents with differing structures. All
documents in an index have to follow the same structure. You can still
use Sphinx with CouchDB very well if you only index views. You have to
know the exact structure of all view results and then you can tell
Sphinx about the strucure and it will be able to index the result.
But if you want to search any arbitrary CouchDB database then it gets a
lot more complicated. Sphinx only supports a fixed number of fulltext
searchable text fields per document (32). That number is definately high
enough for most documents but it does not reflect CouchDB's
flexibility. In order to use Sphinx on a dynamic schema you would have
to go through all documents to create a mapping of the hierarchically
stored values into a one dimensional associative array (2 dimensional
for the multivalue attributes) and then store this mapping with each
document. Now you can go through the documents and extend the static
schema on every document that requires an additional field. You can
either reuse fields which makes the entire grouping and sorting useless
because each field has a different meaning for each document or you
leave a lot of fields empty creating a huge overhead.
An alternative would be to create a lot of indexes with different
schemas as Sphinx supports searching multiple indexes at a time. But I
doubt this idea scales well if you have a different schema on every
document.
So my approach to integration was rather to allow Sphinx to use CouchDB
as a data source. You can configure Sphinx to index a certain view then
and the view will have to produce 1-dimensional JSON results that work
for Sphinx. Searching does not use CouchDB's REST API at all then. This
method works fine for applications where many documents have the same
structure (like the demo forum or an article/comments site like a blog)
or for applications where the number of structures that documents can
have is limited (you can create a mapping to one larger common structure
then). However this will not be useful to any application that really
makes use of CouchDB's flexible structure so I certainly hope there'll
be other systems available for searching.
Cheers!
Nils