Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi all, I've been struggling to find a good way to synchronize Solr with a large number of records. We collect our data from a number of sources and each source produces around 50,000 docs. Each of these document has a sourceId field indicating the source of the document. Now assuming we're

Re: Synchronize large number of records with Solr

2007-09-14 Thread Erik Hatcher
Cuong, I accomplished (in Collex) by attaching a batch number to each document. When indexing a batch (or source), a GUID is generated and every document from that batch/source gets that same identifier attached to it. At the end of the indexing run, I delete everything with that

Re: Synchronize large number of records with Solr

2007-09-14 Thread climbingrose
Hi Erik, So in your case #1, documents are reindexed with this scheme - so if you truly need to skip a reindexing for some reason (why, though?) you'll need to come up with some other mechanism. [perhaps update could be enhanced to allow ignoring a duplicate id rather than reindexing?] It's

Re: Synchronize large number of records with Solr

2007-09-14 Thread Chris Hostetter
: number of records. We collect our data from a number of sources and each : source produces around 50,000 docs. Each of these document has a sourceId : field indicating the source of the document. Now assuming we're indexing all : documents from SourceA (sourceId=SourceA), majority of these docs

Re: Synchronize large number of records with Solr

2007-09-14 Thread Walter Underwood
You could MD4 the parts you care about, store that, fetch it and compare. If there is a reliable timestamp, you could use that. But that would be app-dependent. In general, you need to store some info about each source document and figure out whether it is new. This get much hairier with a web