Unfortunately, I cannot assume SolrCloud, because our software predates Solr.
So I would either need to switch to Solr or reimplement a work-around for the lack of index migration. I am reluctant to switch to Solr because it increases the operational complexity. I understand the argument: if the algorithm fₙ() used to derive index data iₙ from the raw data rₙ changes [iₙ=fₙ(rₙ)], the index data iₙ₊₁ may not be derivable from iₙ [∃n∄g \ iₙ=g(iₙ₊₁)]. On the application level, one could store non-tokenized content (I guess that's why ElasticSearch has .raw fields). And traverse the index. I already have index traversal code that I use for garbage collection of old entries. Use the non-tokenized content to build a new index. So the progress of the conversion could be recorded as the index into LeafReader.getLiveDocs(). https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs-- Alternatively, since I do not have all the non-tokenized content in the index now, I could use the external document id to retrieve the original document text. Is there a convenient place to store the getLiveDocs index across process interruptions? Or should I use something stupid like a file to store the counter? That is still a lot of hassle, but I understand how it makes sense for Lucene to consider index migration should be handled up the stack. > On 21 Jun 2019, at 18:06, Erick Erickson <erickerick...@gmail.com> wrote: > > Assuming SolrCloud, reindex from scratch into a new collection then use > collection aliasing when you were ready to switch. You don’t need to stop > your clients when you use CREATEALIAS. > > Prior to writing the marker, Lucene would appear to work with older indexes, > but there would be subtle errors because the information needed to score docs > just wasn’t there. > > Here are two quotes from people who know that crystalized the problem Lucene > faces for me: > > From Robert Muir: > > “I think the key issue here is Lucene is an index not a database. Because it > is a lossy index and does not retain all of the user's data, its not possible > to safely migrate some things automagically. In the norms case IndexWriter > needs to re-analyze the text ("re-index") and compute stats to get back the > value, so it can be re-encoded. The function is y = f(x) and if x is not > available its not possible, so lucene can't do it.” > > From Mike McCandless: > > “This really is the difference between an index and a database: we do not > store, precisely, the original documents. We store an efficient > derived/computed index from them. Yes, Solr/ES can add database-like > behavior where they hold the true original source of the document and use > that to rebuild Lucene indices over time. But Lucene really is just a > "search index" and we need to be free to make important improvements with > time.” > > Best, > Erick > >> On Jun 21, 2019, at 7:10 AM, David Allouche <da...@allouche.net> wrote: >> >> Wow. That is annoying. What is the reason for this? >> >> I assumed there was a smooth upgrade path, but apparently, by design, one >> has to rebuild the index at least once every two major releases. >> >> So, my question becomes, what is the recommended way of dealing with >> reindex-from-scratch without service interruption? >> >> So I guess the upgrade path looks something like: >> - Create Lucene6 index >> - Update Lucene6 index >> - Create Lucene7 index >> - Separately keep track of which documents are indexed in Lucene7 and >> Lucene6 indexes >> - Make updates to Lucene6 index, concurrently build Lucene7 index from >> scratch, user Lucene6 index for search. >> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 >> index for search. >> >> Rinse and repeat every major version. >> >> Really, isn't there something simpler already to handle Lucene major version >> upgrades? >> >> >>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> Let’s back up a bit. What version of Lucene are you using? Starting with >>> Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It >>> does not matter if the index has been completely rewritten. It does not >>> matter if it’s been run through IndexUpgraderTool, which just does a >>> forceMerge to 1 segment. A marker is preserved when a segment is created, >>> and the earliest one is preserved across merges. So say you have two >>> segments, one created with 6 and one with 7. The Lucene 6 marker is >>> preserved when they are merged. >>> >>> Now, if any segment has the Lucene 6 marker, the index will not be opened >>> by Lucene. >>> >>> If you’re using Lucene 7, then this error implies that one or more of your >>> segments was created with Lucene 5 or earlier. >>> >>> So you probably need to re-index from scratch on whatever version of Lucene >>> you want to use. >>> >>> Best, >>> Erick >>> >>> >>> >>>> On Jun 17, 2019, at 8:41 AM, David Allouche <da...@allouche.net> wrote: >>>> >>>> Hello, >>>> >>>> I use Lucene with PyLucene on a public-facing web application. We have a >>>> moderately large index (~24M documents, ~11GB index data), with a constant >>>> stream of new documents. >>>> >>>> I recently upgraded to PyLucene 7. >>>> >>>> When trying to test the new release of PyLucene 8, I encountered an >>>> IndexFormatTooOld error because my index conversion from Lucene6 to >>>> Lucene7 was not complete. >>>> >>>> I found IndexUpgrader, and I had a look at its implementation. I would >>>> very much like to avoid putting down the service during the index upgrade, >>>> so I believe I cannot use IndexUpgrader because I need the write lock to >>>> be held by the web application to index new documents. >>>> >>>> So I figure I could get the desired result with an >>>> IndexWriter.forceMerge(1). But the documentation says "This is a horribly >>>> costly operation, especially when you pass a small maxNumSegments; usually >>>> you should only call this if the index is static (will no longer be >>>> changed)." >>>> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int- >>>> >>>> And indeed, forceMerge tends be killed the kernel OOM killer on my >>>> development VM. I want to avoid this failure mode in production. I could >>>> increase the VM until it works, but I would rather have a less brutal >>>> approach to upgrading a live index. Something that could run in the >>>> background with reasonable amounts of anonymous memory. >>>> >>>> What is the recommended approach to upgrading a live index? >>>> >>>> How can I know from the code that the index needs upgrading at all? I >>>> could add a manual knob to start an upgrade, but it would be better if it >>>> occurred transparently when I upgrade PyLucene. >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org