Re: Best strategy migrate indexes

Trejkaz Mon, 07 Nov 2022 03:18:43 -0800

The process itself sounds like it should work (it's basically a
reindex so it should be safer than trying to migrate directly.)


I would check that the Luke version matches the Lucene version - if
the two match, it shouldn't be possible to get issues like this.
That is, the precise versions of Lucene each is using.

TX


On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez <[email protected]> wrote:
>
> Hi!
>
> > I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> I am doing the following steps:
>
> - create IndexReader with lucene5 package over a lucene5 index
> - create IndexWriter with lucene7 package
> - iterate over reader.numDocs() to process each Document (lucene5)
>     - convert each Document (lucene5) to lucene7 Document
>         - for each IndexableField (lucene5) from Document (lucene5) convert
> it to create an IndexableField (lucene7)
>             - create a SortedDocValuesField (lucene7) and add it to the
> Document (lucene7)
>             - add the field to the Document (lucene7)
>     - add each converted Document to the writer
> - close  IndexReader and IndexWriter
>
> When I open the resulting migrated lucene7 index with Luke I got an error:
> org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> supported (resource
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> 9 (needs to be between 6 and 7)
>
> When I use the tool "luceneupgrader
> <https://github.com/hakanai/luceneupgrader>", I got:
> java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> tests_small_index-7.x-migrator
> Lucene index version: 7
>
> What am I doing wrong or misleading?
>
> Thanks!
>
> El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<[email protected]>)
> escribió:
>
> > Hi,
> >
> > Luckily we were already using lucenemigrator
> >
> >
> > What do you mean with "lucenemigrator"? Is it a public tool?
> >
> > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > Thanks!
> >
> > El mar, 1 nov 2022 a las 0:35, Trejkaz (<[email protected]>) escribió:
> >
> >> Well...
> >>
> >> There's a way, but I wouldn't necessarily recommend it.
> >>
> >> You can write custom migration code against some version of Lucene
> >> which supports doc values, to create doc values fields. It's going to
> >> involve writing a FilterCodecReader which wraps your real index and
> >> then pretends to also have doc values, which you'll build in a custom
> >> class which works similarly to UninvertingReader. Then you pass those
> >> CodecReaders to IndexWriter.addIndexes to create a new index which
> >> really has those doc values.
> >>
> >> We did that ourselves when we had the same issue. The only painful
> >> thing about it is having to keep around older versions of lucene to do
> >> that migration. Forever. Luckily we were already using lucenemigrator,
> >> which has the older versions baked into it with package prefixes. So
> >> that library will get fatter and fatter over time but at least our own
> >> code only gets fatter at the rate migrations are added.
> >>
> >> The same approach works for any other kind of ad-hoc migration you
> >> might want to perform. e.g., you might want to create points. Or
> >> remove an index for a field. Or add an index for a field.
> >>
> >> TX
> >>
> >>
> >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <[email protected]>
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Thank you all for your responses.
> >> >
> >> > So, when updating to a newer (major) Lucene version that modifies its
> >> > codecs, there is no way to ensure everything keeps working properly,
> >> unless
> >> > re-indexing, right?
> >> >
> >> > Apart from not having some original sources that were indexed (which I
> >> will
> >> > try to solve by using the *IndexUpgrader *tool), I have another
> >> problem: I
> >> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> >> > queries against the index, mainly using the grouping api. But
> >> currently, it
> >> > was removed (since Lucene 7.0). So, again, do I have any other
> >> alternative,
> >> > apart from re-indexing to use docValues?
> >> >
> >> > To give you more context, I am a developer of a tool that multiple
> >> > customers can use to index their data (currently, with Lucene 5.5.5). We
> >> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> >> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> >> down
> >> > well :(
> >> >
> >> > Regards,
> >> >
> >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<[email protected]>)
> >> > escribió:
> >> >
> >> > > Inside of Zulia search engine, the object being indexed is always a
> >> > > JSON/BSON object and we store the BSON as a stored byte field in the
> >> > > index.  This allows easy internal reindexing when the searchable
> >> fields
> >> > > change but also allows us to update to the latest lucene version.
> >> > >  Combined with using lucene-backward-codecs an older index than the
> >> current
> >> > > major version can be opened and reindexed.  If you have stored all the
> >> > > fields (or a json/bson) in the index, it would be easy to reindex in
> >> the
> >> > > new format.  If you have not, maybe opening with
> >> lucene-backward-codecs
> >> > > will be enough for your use case.
> >> > >
> >> > > Thanks,
> >> > > Matt
> >> > >
> >> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <[email protected]>
> >> > > wrote:
> >> > >
> >> > > > It is always great practice to retain non-indexed
> >> > > > data since when Lucene changes version,
> >> > > > even minor version, I always reindex.
> >> > > >
> >> > > > Best regards
> >> > > > ________________________________
> >> > > > From: Gus Heck <[email protected]>
> >> > > > Sent: Saturday, October 29, 2022 2:17 PM
> >> > > > To: [email protected] <[email protected]>
> >> > > > Subject: Re: Best strategy migrate indexes
> >> > > >
> >> > > > Hi Pablo,
> >> > > >
> >> > > > The deafening silence is probably nobody wanting to give you the bad
> >> > > news.
> >> > > > You are on a mission that may not be feasible, and even if you can
> >> get it
> >> > > > to "work", the end result won't likely be equivalent to indexing the
> >> > > > original data with Lucene 9.x. The indexing process is fundamentally
> >> > > lossy
> >> > > > and information originally used to produce non-stored fields will
> >> have
> >> > > been
> >> > > > thrown out. A simple example is things like stopwords or anything
> >> > > analyzed
> >> > > > with subclasses of FilteringTokenFilter. If the stop word list
> >> changed,
> >> > > or
> >> > > > the details of one of these filters changed (bugfix?), you will end
> >> up
> >> > > with
> >> > > > a different result than indexing with 9.x. This is just one
> >> > > > example, another would be stemming where the index likely only
> >> contains
> >> > > the
> >> > > > stem, not the whole word. Other folks who are more interested in the
> >> > > > details of our codecs than I am can probably provide further
> >> examples on
> >> > > a
> >> > > > more fundamental level. Lucene is not a database, and the source
> >> > > documents
> >> > > > should always be retained in a form that can be reindexed. If you
> >> have
> >> > > > inherited a system where source material has not been retained, you
> >> have
> >> > > a
> >> > > > difficult project and may have some potentially painful expectation
> >> > > setting
> >> > > > to perform.
> >> > > >
> >> > > > Best,
> >> > > > Gus
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
> >> > > [email protected]>
> >> > > > wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I have some indices indexed with lucene 5.5.0. I have updated my
> >> > > > > dependencies and code to Lucene 7 (but my final goal is to use
> >> Lucene
> >> > > 9)
> >> > > > > and when trying to work with them I am having the exception:
> >> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
> >> version is
> >> > > not
> >> > > > > supported (resource
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
> >> > > > > this index is too old (version: 5.5.0). This version of Lucene
> >> only
> >> > > > > supports indexes created with release 6.0 and later.
> >> > > > >
> >> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
> >> > > > > strategy? Is there any tool to migrate the indices? Is it
> >> mandatory to
> >> > > > > reindex? In this case, how can I deal with this when I do not
> >> have the
> >> > > > > sources of documents that generated my current indices (I mean, I
> >> just
> >> > > > have
> >> > > > > the indices themselves)?
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > --
> >> > > > > Pablo Vázquez
> >> > > > > ([email protected])
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > >
> >> > > >
> >> > >
> >> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
> >> > > >  (work)
> >> > > >
> >> > > >
> >> > >
> >> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
> >> > > >  (play)
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Pablo Vázquez
> >> > ([email protected])
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
> > --
> > Pablo Vázquez
> > ([email protected])
> >
>
>
> --
> Pablo Vázquez
> ([email protected])

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Best strategy migrate indexes

Reply via email to