Hi all,
I've been struggling to find a good way to synchronize Solr with a large
number of records. We collect our data from a number of sources and each
source produces around 50,000 docs. Each of these document has a sourceId
field indicating the source of the document. Now assuming we're
Cuong,
I accomplished (in Collex) by attaching a batch number to each
document. When indexing a batch (or source), a GUID is generated and
every document from that batch/source gets that same identifier
attached to it. At the end of the indexing run, I delete everything
with that
Hi Erik,
So in your case #1, documents are reindexed with this scheme - so if you
truly need to skip a reindexing for some reason (why, though?) you'll
need to come up with some other mechanism. [perhaps update could be
enhanced to allow ignoring a duplicate id rather than reindexing?]
It's
: number of records. We collect our data from a number of sources and each
: source produces around 50,000 docs. Each of these document has a sourceId
: field indicating the source of the document. Now assuming we're indexing all
: documents from SourceA (sourceId=SourceA), majority of these docs
You could MD4 the parts you care about, store that, fetch it and compare.
If there is a reliable timestamp, you could use that. But that would be
app-dependent.
In general, you need to store some info about each source document
and figure out whether it is new. This get much hairier with a web