Hi! On Tue, Dec 18, 2018 at 10:10:51AM +0100, Serge Ballesta via Python-ideas <[email protected]> wrote: > In a project of mine, I have used the gettext module from Python Standard > Library. I have found that several tools could be used to generate the > Machine Object (mo) file from the source Portable Object (one): pybabel ( > http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the > original msgfmt from GNU gettext.
I use gettext quite extensively. I use Python's msgfmt to generate .mo files. I also use Django's compilemessage; I don't know what it uses internally, it could be an independent implementation or Python's msgfmt. > I could find that only the original msgfmt was able to generate a hashtable, > and that anyway the Python gettext module loaded everything in memory and did > not use it. But I also find a TODO note saying > > # TODO: > # - Lazy loading of .mo files. Currently the entire catalog is loaded into > # memory, but that's probably bad for large translated programs. Instead, > # the lexical sort of original strings in GNU .mo files should be exploited > # to do binary searches and lazy initializations. Or you might want to use > # the undocumented double-hash algorithm for .mo files with hash tables, but > # you'll need to study the GNU gettext code to do this. > > I have studied GNU gettext code and found that implemententing the hashing > algorithm in Python would not be that hard. That's interesting! > The undocumented features required for implementation are: > - the version number can safely stay to 0 when processing Python code > - the size of the hash table is the first odd prime greater than or equal to > 4 * n / 3 where n is the number of strings > - the first hashing function uses a variant of PJW hash function described in > https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ > high is replaced with h = h ^ high, and using 32 bits integers. The index in > the table in the result of the function modulus the size of the hash table > - when there is a conflict (the slot given by the first hashing function is > already used by another string) the following is used: > - let h be the result of the PJW variant hash function and size be the size > of the hash table, an increment value is set to 1 +( h % (size -2)) > - that increment is repeatedly added to the index in the hash table (modulus > the table size) until an empty slot is found (or the correct original string > is found) > > For now, my (alpha) code is able to generate in pure Python the same mo file > that GNU msgfmt generates, and use the hashtable to access the strings. > > Remaining problems: > - I had to read GPL copyrighted code to find the undocumented features. I > have of course wrote my own code from scratch, but may I use an Apache Free > License 2.1 on it? You should ask a lawyer and I am not. But my understanding is that you can borrow ideas from a GPL-protected code without contaminating your code with GPL. You cannot copy code -- that makes your code GPL'd. > - the current code for gettext loads everything from the mo file and > immediately closes it. My own code keeps the file opened to be able to access > it with the mmap module. There could be use case where first option is better There is the third option -- open and close the file. I'd prefer the option as file descriptors are precious resources limited in supply. There is a twist though. The file could be replaced while closed so you have to find a way to verify the was replaced and reread the has table from it. Perhaps checking timestamp of the file (date/time of the last modification) is enough. > - I should either rely on the current way (load everything in memory) or > implement a binary search algo for the case where the hash table is not > present (it is of course optional) > - it would be an important change, and I think that options should be allow > to choose between an eager or lazy access > > Before going further, I would like to know whether implementing lazy access > through the hash table that way seems to be a interesting improvement or a > dead end. Well, I mus admit my .po/.mo aren't that big. The biggest .po is 60k, its corresponding .mo is only 30k bytes. I don't know if using the hash table gives me improvement. Oleg. -- Oleg Broytman https://phdru.name/ [email protected] Programmers don't die, they just GOSUB without RETURN. _______________________________________________ Python-ideas mailing list [email protected] https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
