On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> >     I try to print my fileContent array after I read it and it looks
>  >     like this :
> >
> >     ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'"
> >       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'"
> >       "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"]
>


> you have the bytes representation and a duplicate slash in it.
>

the duplicate slash confuses me, but I'm not running py3 to test, so...


> np.loadtxt(file, dtype=bytes).astype(str)
>
> for non ascii I guess you should use python directly as numpy would also
> require a python loop with explicit decoding.
>
> Currently handling strings in python3 with numpy is even worse than
> before, you always have to go over bytes and do explicit decodes to get
> python strings out of ascii data.
>

There is a MASSIVE set of threads on Python-dev about better support for
ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have
two issue shere that could be adressed:

1) loadtext behavior -- it's a really, really common case for  data files
suitable for loadtxt to be ascii, but they also could be another encoding
-- so loadtext should have the option to specify the encoding (default to
ascii? or ascii-compatible?)

The trick here is handling both these cases correctly -- clearly loadtxt is
broken on py3 now. This example works fine under py2.

It seems to be reading the file as bytes, then passing those bytes off to a
unicode string (str in py3), without specifying an encoding (which I think
is how that b' ...'
 junk gets in there.

note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as
well:

In [7]: np.loadtxt('pathlist.txt', dtype=unicode)
Out[7]:
array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt',
       u'C:\\Users\\Documents\\Project\\mytextfile2.txt',
       u'C:\\Users\\Documents\\Project\\mytextfile3.txt'],
      dtype='<U42')

which is what should happen in py3. So the internal loadtxt code must be
confusing bytes and unicode objects...

Anyway, this should work, and there should be an obvious way to spell it.

2) numpy string types -- it seems numpy already has a both a string type
and unicode type -- perhaps some re-naming or better documentation is in
order:
   the string type 'S10', for example, should be clearly defined as 1-byte
per character ascii-compatible.

I'm not sure how many bytes the unicode type has, but it may make sense to
be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go
with UCS-4 and be done with it.



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to