Re: [gentoo-dev] Locale check in python_pkg_setup()

Brian Harring Thu, 29 Jul 2010 19:38:47 -0700

On Fri, Jul 30, 2010 at 01:16:42AM +0200, Arfrever Frehtes Taifersar Arahesis 
wrote:
> --- python.eclass
> +++ python.eclass
> @@ -355,6 +355,8 @@
>       # Check if phase is pkg_setup().
>       [[ "${EBUILD_PHASE}" != "setup" ]] && die "${FUNCNAME}() can be used 
> only in pkg_setup() phase"
>  
> +     local locale
> +
>       if [[ "$#" -ne 0 ]]; then
>               die "${FUNCNAME}() does not accept arguments"
>       fi
> @@ -407,6 +409,16 @@
>               unset -f python_pkg_setup_check_USE_flags
>       fi
>  
> +     locale="$(python -c 'import os; print(os.environ.get("LC_ALL", 
> os.environ.get("LC_CTYPE", os.environ.get("LANG", "POSIX"))))')"


You're using python to get the exported env.  Don't.  Use bash (you're 
invoking python from freaking bash after all)...

> +     if [[ "${locale}" != *.UTF-8 ]]; then
> +             eerror
> +             eerror "Currently used locale '${locale}' is unsupported and 
> can cause build-time or run-time"
> +             eerror "problems (usually UnicodeDecodeErrors or 
> UnicodeEncodeErrors). Bugs caused by this locale"
> +             eerror "will be closed as invalid. It is recommended to use a 
> UTF-8 locale to avoid problems."
> +             eerror "See http://www.gentoo.org/doc/en/utf-8.xml for 
> information on how to fix locale."
> +             eerror

For cases such as this, ewarn, not eerror.  It's not an actual error, 
it's a potential source of problems people may see.

The more I look into this issue, the more I'm convinced it's not user 
settings that are problem- the problem is in the code, not in user 
env.  You've stated in a couple of places that "C/Posix locales are 
not supported", which frankly is very whacked- that's not really a 
proclamation you can make on your own for python, and you're actually 
ignoring that this problem would just as easily rear it's head with a 
latin-1 encoded file.


Take a look at 302425; the traceback in that is a classic example of 
where they *should* be using bytes mode (they don't need to interpret 
the data, just write the script across, thus bytes).

bug 328047 is induced by a patch we add (it's not in upstream python).  
The code in question also is invoking fricking ldd a few steps prior 
which is questionable in multiple ways: either way, relevant chunk is
+            os.system("ldd %s > %s" % (do_readline, tmpfile))
+            fp = open(tmpfile)
+            for ln in fp:

So... roughly, it invokes os.system, which will pass the environment 
straight through to it, meaning locale gets passed down.

Then it open's the file.  Note it specifes *NO ENCODING* nor is their 
actually an enforced locale best I can tell , thus ascii being the 
default.  The screwup here is in our patches- said patches should be 
forcing posix locale for the ldd call (resulting in ascii).  If you 
think through this bug, we've seen this multiple times in grep/sed 
calls- this is literally no different.

bug 287439 is a screw up in the programs source... should've been 
using bytes (non arguable).  Matter of fact, while generally I think 
Tarek knows what the hell he's doing, the skip they added to the 
tests ignored an actual valid bug in setuptools/distribute- shebangs 
from the standpoint of the kernel need to be consistant.  Thus reading 
the shebang line itself should be done in bytes, than converted to 
ascii and interpretted- they tried opening the file (in whole) in 
bytes, meaning they tried enforcing ascii across the whole buffer- 
not just the first line.  Program bug.

These bugs I got via searching for 'ALL python locale', and 
identifying the ones that were actually locale related.  I've at this 
point looked into the source of 3 bugs- meaning literally, 3 bugs 
checked into, 3 instances where the code was wrong.

I'll leave it as an exercise for others to keep digging, but the point 
here is that the programs themselves screwup their locale handling- 
trying to force all systems to use a utf-8 locale for the env is just 
a hack instead of fixing the actual issue.  A pretty bad hack 
considering I've spent all of 30 minutes digging into this and rooting 
out the actual flaws in the src I might add.

For shits and giggles, lets add one more bug in- one that has the 
potential of rearing its head in random consuming pkgs, bug 322425 
(docutils's build_html being flawed), their encoding handling is 
intrinsically flawed.  The encoding of a file their 
installing/parsing should be determined by the file itself- not 
attempting to arbitrarily force it to whatever locale the user happens 
to be running (which is exactly the first thing buildhtml.py attempts, 
literally `locale.setlocale(locale.LC_ALL, '')` at line 20).  The 
issue is not people using ascii locales, the issue is that these tools 
do not handle encoding correctly.

Recall, one of the purposes of py3k going bytes vs text (aka unicode) 
was to make clear that textual data's encoding need be known.  All of 
this code isn't actually forcing/handling the encoding for the data 
they deal in- meaning these are literal bugs, exposed purely due to 
py3k actually enforcing encoding in normal file opens.

So... this is a big -1 on adding such a warning (especially 
considering it doesn't actually resolve the raw issues, it just 
sidesteps a couple of cases).

Fix the actual problem instead...

Finally, cc'ing QA since this is a class of bugs they should be aware 
of with py3k.  This is a bit of a sign that a lot of source isn't 
really py3k ready yet either imo, but so it goes...

~harring

pgppzB2WqrlNB.pgp
Description: PGP signature

Re: [gentoo-dev] Locale check in python_pkg_setup()

Reply via email to