Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Michael Meeks Mon, 31 Jan 2011 07:17:55 -0800

Hi Steve,

On Sat, 2011-01-29 at 21:45 +1000, Steve Butler wrote:
> I haven't had a look at this yet as I thought getting a script to
> analyze the existing thesaurus files would be helpful to get those
> errors looked at.


        Nice work with that :-)

> I thought I would discuss your idea about not using the index at all
> to see what reception it gets, but I think you may also have been
> suggesting a similar thing: are the index files even useful on modern gear?

        I suspect the index files are mostly useless (personally).

> I can populate the en_US index in memory from the .dat file with the
> C++ code in 0.287 s after dropping all cache, and 0.188s when the
> cache is hot.

        Sure - so; in response to user input I suspect we can take a second to
parse the thesaurus; we have around 20Mb of text to load for en_US;
perhaps 32Mb is a reasonable upper-bound; it does seem a lot to parse so
quickly.

> I do admit that my desktop is pretty quick though, with 4 cores, SATA
> II drives etc.

        Sure - but it will only use one of these ;-)

> If the thesaurus is only loaded when the user pops it up, then
> couldn't mythes be taught to generate its own in-memory index
> from the dictionary and not bother with an index file at all?

        Right. I think we could easily serialize a small skip-list to disk too
- if we simply store ~8 or ~32 or so indexes into the data - we can
parse only a fraction of it, and pop that in our home directory. We
could also drop the MyThes code too as a depedency to manage.

        The code using it is in:

        lingucomponent/source/thesaurus/libnth/nthesimp.cxx

> BTW, if I did that I'd probably do some major surgery on mythes and
> just use STL because it basically is doing C style memory management
> and processing and I think I would screw it up if I started messing
> with it.  The only problem with simplifying it with STL constructs is
> that I would want to change the interface (string vs char *), maybe
> use STL vectors for the list of synonyms, etc.

        Heh; sure.

> By this stage it's not looking much like mythes anymore ...

        I guess we could re-write it inside lingucomponent then (?) but we
should prolly get a better understanding of how frequently this code is
called first - is it hooked into from the spell checking code ? or is it
really just the Tools->Language->Thesaurus ?

        Thanks !

                Michael.

-- 
 michael.me...@novell.com  <><, Pseudo Engineer, itinerant idiot

_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: [Libreoffice] Should the Thesaurus/mythes use a precomputed index (installer file size)

Reply via email to