Re: [lingu-dev] Slow dictionary load

Kevin B. Hendricks Tue, 01 May 2007 14:01:32 -0700

Hi Alan,

If you did place the count as the top line (to create a properlysized hash table) then perhaps the only potential speedup is tochange hunspell to mmap a file that is the previously createdhashtable similar to what ispell uses.

The problem only real problem is that all binary formats like thathave endian issues across architectures that make things quitedifficult. That is why I decided with myspell to go with buildingthe hash table on-the-fly so to speak. There are no binarycompatibility issues that way.

Another source of delay when starting up the spell-checker is whenthe user has checked "check word in all languages" option but doesn'trealize that that they have a large number of dictionaries that haveto be loaded when the first misspelt word is checked.

Obviously, for creating hash tables from large .dic files, availablememory is an issue. How much memory do you have available for yourmachine?


Kevin


On May 1, 2007, at 1:08 PM, Alan Yaniger wrote:

Eleonora,
Yes, I used a different dictionary than yours. The hu_HU.dic I usedhas 96,461 lines. Apparently the Hungarian dictionary availablethrough DicOO isn't the latest.
Perhaps your hardware is faster than mine. In my slower(?)hardware, I see a significant difference between building the hashtable for large dictionaries and for smaller ones. Many users havecomplained about OOo "getting stuck" while the dictionaries load.So I think that it would be useful if Hunspell developers couldimprove performance here.
Alan

ge wrote:
Alan,

The size of the 2-nd Hungarian dictionary is:

  lines    words    characters
  22068   124931   622546 hu_HU.aff
 873355   873348 26481165 hu_HU.dic
 895423   998279 27103711 total

dic contains 873378 words, it is 8 times larger than Hebrew.
aff is roughly twice as big as Hebrew.
I assume, you used the 1-st Hungarian one, with the small wordcount for your test.
I use the 2-nd all the time, and it loads in
less than 1 second for me.
Therefore I do not understand the effect you
describe.

-eleonora
Hi Marcin, Janis, Eleanora,
I did some debugging in the hunspell code, and found that thesize ofthe Hebrew dictionaries was the cause of the delay, similar toJanis'sproblem in Latvian. The files are read line by line, andhe_IL.dic has329,326 entries, which is far more than the other dictionies Itried.The main bottleneck was not in reading the files from the disk,but inbuilding the hash tables in hashmgr.cxx in add_word(). When Ishortened
he_IL.dic to the size of the Hungarian dictionary, it took the same
amount of time to load Hebrew and Hungarian. Same with Hebrew and
English US.
To Hunspell developers out there: is there any way to make thebuilding
of the hash tables more efficient?

Alan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: dev-[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: dev-[EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Slow dictionary load

Reply via email to