Hi Peter,
On 06/19/2013 03:28 PM, Peter Karman wrote:
On 6/19/13 2:59 AM, Moritz Lenz wrote:
Hi all,
I have about 10 million small records (less than 1kb each) that I want
to index with Lucy (through the Perl frontend). The primary data store
is a relational database.
So I create my search index, wait a day, and then want to index all the
new records/documents. For finding out which records are new, I have to
know which ones are already indexed. For 10 mio records (and only a few
thousand new each day) it's not efficient to check each one, so I'd like
to store some thing like a "last indexed ID" or "last indexed timestamp"
or so along with the search index.
Is there any way to store such meta data along with the search index?
(I know I could store it inside the RDBMS, but that doesn't feel right
from an architectural point of view; the RDBMS shouldn't care about the
existence of the search index at all; nor do I want to lose information
about the search index when overwrite the contents of my RDBMS' database
with a backup).
How do other people solve that problem?
Hi Moritz,
I have the exact same use case.
Lucy stores its index files in a simple directory, so you can put
whatever files you want to in that same directory.
In my case, I use swish3 (via SWISH::Prog::Lucy on CPAN) to crawl and
index my documents (records), and search with Dezi.
The -N option to swish3 relies on the indexdir/swish_last_start file to
keep track of incremental changes. Only records with a modtime newer
than -N are indexed during the normal crawl, which in my case happens
every three or four minutes.
The disadvantage of that approach is that it forces the user to
reimplement the locking that lucy surely already implements for writing
the index.
It would be awesome to be able to do
$indexer->set_meta_data($key, $value);
$indexer->commit;
have the commit write the meta data along with the rest of the data.
swish3 also stores a swish.xml meta file in the index dir, which tracks
metadata specific to swish3.
The $work project where I do all this is on github, including the
wrapper script for swish3:
https://github.com/publicinsightnetwork/audience-insight-repository/blob/master/bin/indexer
Thanks for the pointer.
I think for now I'll just store the last indexed document ID in a file,
and tell the users not to run multiple indexing processes in parallel.
Cheers,
Moritz