Can you try to cache the word segmentation results? This will be easier.
At 2021-11-22 16:40:42, "Omri" <omri.sui...@clearmash.com> wrote: >We are indexing a lot of similar texts using Lucene analyzers. >From our performance tests we see that the analyzation (converting the text >the tokensteam object) is talking more time that we want. >Before digging into the analyzation code, I was thinking about caching the >analyzation result since we have many repeated texts that we index in >different times. >The basic idea is to serialize the tokenstream and store it in a DB. when we >encounter the same text, to load it and initialize an analyzer with the loaded >tokenstream. >In this context: >1 - is it "safe" to serialize the tokenstream? >2 - there is an existing code that already serialize a tokenstream? >3 - how to initialize an existing analyzer with a tokenstream? > >Thanks! > >Best, >Omri >The contents of this e-mail message and any attachments are confidential and >are intended solely for addressee. The information may also be legally >privileged. This transmission is sent in trust, for the sole purpose of >delivery to the intended recipient. If you have received this transmission in >error, any use, reproduction or dissemination of this transmission is strictly >prohibited. If you are not the intended recipient, please immediately notify >the sender by reply e-mail or phone and delete this message and its >attachments, if any.