You wrote:

 > I want to use this encoding
 > <https://github.com/vanangamudi/tace16-utf8-converter/blob/master/tace16.py>
 > for Tamil language text

As written, it sounds like you just want help.  If so, this list is
for proposals to change Python itself (including the standard
library), and this should have been posted to python-list or to
StackExchange.

If you do mean to propose this for the stdlib, it is highly unlikely
to get in as proposed since the encoding commandeers private space in
the BMP, which is a scarce resource.  We *can* do that, but it's very
likely that the general sentiment will be "do it in a PyPI module,
then it *can't* cause anybody else any trouble."  In principle it's
not our job to "fix" Unicode.  That's the work of the relevant
national standards body for Tamil and the Unicode Consortium.  (I am
not authoritative, so if that's what you want, don't take my word for
it.  I just want you to be prepared for what I expect to be strong
pushback, and what the argument will be.)

About the proposal:

If you are planning to use TACE16 as an interchange format, you don't
need a codec; you just treat it as normal UTF-8 (or any other UTF, for
that matter).  Python does not care whether a character is standard or
private, it just adds it to the str the codec is building.

If you propose to use the codec to translate standard Unicode to
TACE16 as the internal format, the obvious (rough) idea would be to
just plug the converter you have written into the stdlib's Unicode
codecs as a post-processor when there is a Unicode character in the
(standard) Tamil block.  This would then handle both the standard
Unicode encoding for Tamil, as well as TACE16 (because it would just
pass through the UTF-8 part, and the converter would ignore it).

You may want two separate codecs for output: one which produces TACE16
for you, and another which produces standard Unicode for anyone who
doesn't have TACE16 capability.

Exactly how to do that is above my pay grade, it depends on how the
postprocessor works, which depends on Tamil language knowledge that I
don't have.  Whether to rewrite the converter in C is up to you, it's
possible to call Python from C.

 > Two basic questions,
 > 
 >    1. How do I approach writing a new text encoding codec for
 >       python and register it with the codec module.

Start here:
/Users/steve/src/Python/cpython/Doc/library/codecs.rst
/Users/steve/src/Python/cpython/Doc/c-api/codec.rst

To write them in C, follow the code in 
Likely needed (forgot where the Unicode codecs live, try codecs.[ch] first):
/Users/steve/src/Python/cpython/Python/codecs.c
/Users/steve/src/Python/cpython/Include/codecs.h
/Users/steve/src/Python/cpython/Objects/stringlib/codecs.h
/Users/steve/src/Python/cpython/Objects/unicodectype.c
/Users/steve/src/Python/cpython/Lib/codecs.py
/Users/steve/src/Python/cpython/Modules/_codecsmodule.c
Probably not needed:
/Users/steve/src/Python/cpython/Modules/cjkcodecs
/Users/steve/src/Python/cpython/Modules/clinic/_codecsmodule.c.h

 >    2. How would I convert utf-8 encoded pattern for regex into the
 >       custom codec so that the pattern and input string for
 >       re.match/search is consistent.

You don't.  That's the point of the codec: you convert all text
(including source program text) into an internal "abstract text" type
(ie, str), and then it "just works".  Instead, you would read program
text as utf-8-tace16 by placing a PEP 263 coding cookie in one of the
first two lines of your program, like this:

    # -*- encoding: utf-8-tace16 -*-

If you think that's ugly, read the PEP for alternative forms.  If you
want to avoid it entirely, I'm not sure it's possible, but python-list
or StackExchange are better places to ask.

Regards,
Steve

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6KE3I2GU2YP4YXW2NZOGF7WY3E77TIYJ/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to