On 6/19/13 2:59 AM, Moritz Lenz wrote:
Hi all,

I have about 10 million small records (less than 1kb each) that I want
to index with Lucy (through the Perl frontend). The primary data store
is a relational database.

So I create my search index, wait a day, and then want to index all the
new records/documents. For finding out which records are new, I have to
know which ones are already indexed. For 10 mio records (and only a few
thousand new each day) it's not efficient to check each one, so I'd like
to store some thing like a "last indexed ID" or "last indexed timestamp"
or so along with the search index.

Is there any way to store such meta data along with the search index?

(I know I could store it inside the RDBMS, but that doesn't feel right
from an architectural point of view; the RDBMS shouldn't care about the
existence of the search index at all; nor do I want to lose information
about the search index when overwrite the contents of my RDBMS' database
with a backup).

How do other people solve that problem?


Hi Moritz,

I have the exact same use case.

Lucy stores its index files in a simple directory, so you can put whatever files you want to in that same directory.

In my case, I use swish3 (via SWISH::Prog::Lucy on CPAN) to crawl and index my documents (records), and search with Dezi.

The -N option to swish3 relies on the indexdir/swish_last_start file to keep track of incremental changes. Only records with a modtime newer than -N are indexed during the normal crawl, which in my case happens every three or four minutes.

swish3 also stores a swish.xml meta file in the index dir, which tracks metadata specific to swish3.

The $work project where I do all this is on github, including the wrapper script for swish3:

https://github.com/publicinsightnetwork/audience-insight-repository/blob/master/bin/indexer

--
Peter Karman  .  http://peknet.com/  .  [email protected]

Reply via email to