Poor man's shard splitting

Erick Erickson Thu, 15 Nov 2012 08:25:44 -0800

Hmmm, I actually suspect this would fall down for the very data sets where
shard-splitting is most useful (i.e. a huge corpus that needs more shards
and re-indexing is a very costly operation), but I thought I'd toss it out
there.


Basically, if SolrCloud had a special delete mode or even a special add
mode that specified the _old_ number of shards, could it do an
under-the-covers delete to the right shard and then add normally? Something
like
<add>
  <doc oldShardCount=###>
    <field..../>
  </doc>
</add>

That would give SolrCloud the ability to know which shard to delete the old
doc from, and after that adding the new doc would proceed as normal. Even
specifying a list of old shard counts would be doable, but I _really_ doubt
that would be worth it.

This would allow the number of shards to be expanded and populated over
however long it took to reindex all the old data.

Not nearly as efficient as shard splitting, and there are obvious gotchas
around the term frequencies "for a while" on the new shards until they get
enough documents to bring the stats up to par with older shards. not to
mention the fact that if someone did all this and messed up their shards
would be all screwed up. Not to mention that each add would generate a
corresponding delete query and there wold be a brief period during which
the doc wouldn't be available.

But maybe enough simpler to do that it would serve as a bridge. OTOH, if
the actual shard splitting happens in the near-enough future, this is not
worth the effort for sure.

Maybe some kind of config state instead that caused something like this to
happen automatically? The win here is not having to stage an entirely new
set of shards and repopulate that completely before switching over.

Mostly I'm throwing this out to see if it sparks any better ideas....

Poor man's shard splitting

Reply via email to