Hi!

On Tue, Dec 18, 2018 at 10:10:51AM +0100, Serge Ballesta via Python-ideas 
<[email protected]> wrote:
> In a project of mine, I have used the gettext module from Python Standard 
> Library. I have found that several tools could be used to generate the 
> Machine Object (mo) file from the source Portable Object (one): pybabel ( 
> http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the 
> original msgfmt from GNU gettext. 

   I use gettext quite extensively. I use Python's msgfmt to generate
.mo files. I also use Django's compilemessage; I don't know what it uses
internally, it could be an independent implementation or Python's msgfmt.

> I could find that only the original msgfmt was able to generate a hashtable, 
> and that anyway the Python gettext module loaded everything in memory and did 
> not use it. But I also find a TODO note saying 
> 
> # TODO: 
> # - Lazy loading of .mo files. Currently the entire catalog is loaded into 
> # memory, but that's probably bad for large translated programs. Instead, 
> # the lexical sort of original strings in GNU .mo files should be exploited 
> # to do binary searches and lazy initializations. Or you might want to use 
> # the undocumented double-hash algorithm for .mo files with hash tables, but 
> # you'll need to study the GNU gettext code to do this. 
> 
> I have studied GNU gettext code and found that implemententing the hashing 
> algorithm in Python would not be that hard. 

   That's interesting!

> The undocumented features required for implementation are: 
> - the version number can safely stay to 0 when processing Python code 
> - the size of the hash table is the first odd prime greater than or equal to 
> 4 * n / 3 where n is the number of strings 
> - the first hashing function uses a variant of PJW hash function described in 
> https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ 
> high is replaced with h = h ^ high, and using 32 bits integers. The index in 
> the table in the result of the function modulus the size of the hash table 
> - when there is a conflict (the slot given by the first hashing function is 
> already used by another string) the following is used: 
> - let h be the result of the PJW variant hash function and size be the size 
> of the hash table, an increment value is set to 1 +( h % (size -2)) 
> - that increment is repeatedly added to the index in the hash table (modulus 
> the table size) until an empty slot is found (or the correct original string 
> is found) 
> 
> For now, my (alpha) code is able to generate in pure Python the same mo file 
> that GNU msgfmt generates, and use the hashtable to access the strings. 
> 
> Remaining problems: 
> - I had to read GPL copyrighted code to find the undocumented features. I 
> have of course wrote my own code from scratch, but may I use an Apache Free 
> License 2.1 on it? 

   You should ask a lawyer and I am not. But my understanding is that
you can borrow ideas from a GPL-protected code without contaminating
your code with GPL. You cannot copy code -- that makes your code GPL'd.

> - the current code for gettext loads everything from the mo file and 
> immediately closes it. My own code keeps the file opened to be able to access 
> it with the mmap module. There could be use case where first option is better 

   There is the third option -- open and close the file. I'd prefer the
option as file descriptors are precious resources limited in supply.
   There is a twist though. The file could be replaced while closed so
you have to find a way to verify the was replaced and reread the has
table from it. Perhaps checking timestamp of the file (date/time of the
last modification) is enough.

> - I should either rely on the current way (load everything in memory) or 
> implement a binary search algo for the case where the hash table is not 
> present (it is of course optional) 
> - it would be an important change, and I think that options should be allow 
> to choose between an eager or lazy access 
> 
> Before going further, I would like to know whether implementing lazy access 
> through the hash table that way seems to be a interesting improvement or a 
> dead end. 

   Well, I mus admit my .po/.mo aren't that big. The biggest .po is 60k,
its corresponding .mo is only 30k bytes. I don't know if using the hash
table gives me improvement.

Oleg.
-- 
    Oleg Broytman            https://phdru.name/            [email protected]
           Programmers don't die, they just GOSUB without RETURN.
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to