Re: [lucy-dev] Storing meta data along with the index

Moritz Lenz Thu, 20 Jun 2013 23:43:51 -0700

Hi Peter,

On 06/19/2013 03:28 PM, Peter Karman wrote:

On 6/19/13 2:59 AM, Moritz Lenz wrote:

 Hi all,


 I have about 10 million small records (less than 1kb each) that I want
 to index with Lucy (through the Perl frontend). The primary data store
 is a relational database.

 So I create my search index, wait a day, and then want to index all the
 new records/documents. For finding out which records are new, I have to
 know which ones are already indexed. For 10 mio records (and only a few
 thousand new each day) it's not efficient to check each one, so I'd like
 to store some thing like a "last indexed ID" or "last indexed timestamp"
 or so along with the search index.

 Is there any way to store such meta data along with the search index?

 (I know I could store it inside the RDBMS, but that doesn't feel right
 from an architectural point of view; the RDBMS shouldn't care about the
 existence of the search index at all; nor do I want to lose information
 about the search index when overwrite the contents of my RDBMS' database
 with a backup).

 How do other people solve that problem?



Hi Moritz,

I have the exact same use case.

Lucy stores its index files in a simple directory, so you can put
whatever files you want to in that same directory.

In my case, I use swish3 (via SWISH::Prog::Lucy on CPAN) to crawl and
index my documents (records), and search with Dezi.

The -N option to swish3 relies on the indexdir/swish_last_start file to
keep track of incremental changes. Only records with a modtime newer
than -N are indexed during the normal crawl, which in my case happens
every three or four minutes.

The disadvantage of that approach is that it forces the user toreimplement the locking that lucy surely already implements for writingthe index.


It would be awesome to be able to do

$indexer->set_meta_data($key, $value);
$indexer->commit;

have the commit write the meta data along with the rest of the data.

swish3 also stores a swish.xml meta file in the index dir, which tracks
metadata specific to swish3.

The $work project where I do all this is on github, including the
wrapper script for swish3:

https://github.com/publicinsightnetwork/audience-insight-repository/blob/master/bin/indexer


Thanks for the pointer.

I think for now I'll just store the last indexed document ID in a file,and tell the users not to run multiple indexing processes in parallel.


Cheers,
Moritz

Re: [lucy-dev] Storing meta data along with the index

Reply via email to