On 6/19/13 2:59 AM, Moritz Lenz wrote:
Hi all,
I have about 10 million small records (less than 1kb each) that I want
to index with Lucy (through the Perl frontend). The primary data store
is a relational database.
So I create my search index, wait a day, and then want to index all the
new records/documents. For finding out which records are new, I have to
know which ones are already indexed. For 10 mio records (and only a few
thousand new each day) it's not efficient to check each one, so I'd like
to store some thing like a "last indexed ID" or "last indexed timestamp"
or so along with the search index.
Is there any way to store such meta data along with the search index?
(I know I could store it inside the RDBMS, but that doesn't feel right
from an architectural point of view; the RDBMS shouldn't care about the
existence of the search index at all; nor do I want to lose information
about the search index when overwrite the contents of my RDBMS' database
with a backup).
How do other people solve that problem?
Hi Moritz,
I have the exact same use case.
Lucy stores its index files in a simple directory, so you can put
whatever files you want to in that same directory.
In my case, I use swish3 (via SWISH::Prog::Lucy on CPAN) to crawl and
index my documents (records), and search with Dezi.
The -N option to swish3 relies on the indexdir/swish_last_start file to
keep track of incremental changes. Only records with a modtime newer
than -N are indexed during the normal crawl, which in my case happens
every three or four minutes.
swish3 also stores a swish.xml meta file in the index dir, which tracks
metadata specific to swish3.
The $work project where I do all this is on github, including the
wrapper script for swish3:
https://github.com/publicinsightnetwork/audience-insight-repository/blob/master/bin/indexer
--
Peter Karman . http://peknet.com/ . [email protected]