Hi Alan,
If you did place the count as the top line (to create a properly
sized hash table) then perhaps the only potential speedup is to
change hunspell to mmap a file that is the previously created
hashtable similar to what ispell uses.
The problem only real problem is that all binary formats like that
have endian issues across architectures that make things quite
difficult. That is why I decided with myspell to go with building
the hash table on-the-fly so to speak. There are no binary
compatibility issues that way.
Another source of delay when starting up the spell-checker is when
the user has checked "check word in all languages" option but doesn't
realize that that they have a large number of dictionaries that have
to be loaded when the first misspelt word is checked.
Obviously, for creating hash tables from large .dic files, available
memory is an issue. How much memory do you have available for your
machine?
Kevin
On May 1, 2007, at 1:08 PM, Alan Yaniger wrote:
Eleonora,
Yes, I used a different dictionary than yours. The hu_HU.dic I used
has 96,461 lines. Apparently the Hungarian dictionary available
through DicOO isn't the latest.
Perhaps your hardware is faster than mine. In my slower(?)
hardware, I see a significant difference between building the hash
table for large dictionaries and for smaller ones. Many users have
complained about OOo "getting stuck" while the dictionaries load.
So I think that it would be useful if Hunspell developers could
improve performance here.
Alan
ge wrote:
Alan,
The size of the 2-nd Hungarian dictionary is:
lines words characters
22068 124931 622546 hu_HU.aff
873355 873348 26481165 hu_HU.dic
895423 998279 27103711 total
dic contains 873378 words, it is 8 times larger than Hebrew.
aff is roughly twice as big as Hebrew.
I assume, you used the 1-st Hungarian one, with the small word
count for your test.
I use the 2-nd all the time, and it loads in
less than 1 second for me.
Therefore I do not understand the effect you
describe.
-eleonora
Hi Marcin, Janis, Eleanora,
I did some debugging in the hunspell code, and found that the
size of
the Hebrew dictionaries was the cause of the delay, similar to
Janis's
problem in Latvian. The files are read line by line, and
he_IL.dic has
329,326 entries, which is far more than the other dictionies I
tried.
The main bottleneck was not in reading the files from the disk,
but in
building the hash tables in hashmgr.cxx in add_word(). When I
shortened
he_IL.dic to the size of the Hungarian dictionary, it took the same
amount of time to load Hebrew and Hungarian. Same with Hebrew and
English US.
To Hunspell developers out there: is there any way to make the
building
of the hash tables more efficient?
Alan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: dev-
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: dev-
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]