On Wed, Jan 15, 2014 at 4:38 AM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote:
> > I try to print my fileContent array after I read it and it looks > > like this : > > > > ["b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile1.txt'" > > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile2.txt'" > > "b'C:\\\\Users\\\\Documents\\\\Project\\\\mytextfile3.txt'"] > > you have the bytes representation and a duplicate slash in it. > the duplicate slash confuses me, but I'm not running py3 to test, so... > np.loadtxt(file, dtype=bytes).astype(str) > > for non ascii I guess you should use python directly as numpy would also > require a python loop with explicit decoding. > > Currently handling strings in python3 with numpy is even worse than > before, you always have to go over bytes and do explicit decodes to get > python strings out of ascii data. > There is a MASSIVE set of threads on Python-dev about better support for ASCII and ASCII+binary data in py3 -- but in the meantime, I think we have two issue shere that could be adressed: 1) loadtext behavior -- it's a really, really common case for data files suitable for loadtxt to be ascii, but they also could be another encoding -- so loadtext should have the option to specify the encoding (default to ascii? or ascii-compatible?) The trick here is handling both these cases correctly -- clearly loadtxt is broken on py3 now. This example works fine under py2. It seems to be reading the file as bytes, then passing those bytes off to a unicode string (str in py3), without specifying an encoding (which I think is how that b' ...' junk gets in there. note that: np.loadtxt('pathlist.txt', dtype=unicode) works fine on py2 as well: In [7]: np.loadtxt('pathlist.txt', dtype=unicode) Out[7]: array([u'C:\\Users\\Documents\\Project\\mytextfile1.txt', u'C:\\Users\\Documents\\Project\\mytextfile2.txt', u'C:\\Users\\Documents\\Project\\mytextfile3.txt'], dtype='<U42') which is what should happen in py3. So the internal loadtxt code must be confusing bytes and unicode objects... Anyway, this should work, and there should be an obvious way to spell it. 2) numpy string types -- it seems numpy already has a both a string type and unicode type -- perhaps some re-naming or better documentation is in order: the string type 'S10', for example, should be clearly defined as 1-byte per character ascii-compatible. I'm not sure how many bytes the unicode type has, but it may make sense to be abel to choose UCS-2 or UCS-4 -- though memory is cheep, I'd probably go with UCS-4 and be done with it. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion