Re: [Numpy-discussion] String type again.

Aldcroft, Thomas Fri, 18 Jul 2014 07:37:29 -0700

On Thu, Jul 17, 2014 at 11:52 AM, Nathaniel Smith <n...@pobox.com> wrote:


> On Tue, Jul 15, 2014 at 7:40 PM, Aldcroft, Thomas
> <aldcr...@head.cfa.harvard.edu> wrote:
> >
> > On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <n...@pobox.com> wrote:
> >>
> >> OTOH, fixed length nul padded latin1 would be useful for various flat
> file
> >> reading tasks.
> >
> > As one of the original agitators for this, let me re-iterate that what
> the
> > astronomical community *really* wants is the original proposal as
> described
> > by Chris Barker [1] and essentially what Charles said.  We have large
> data
> > archives that have ASCII string data in binary formats like FITS and
> HDF5.
> > The current readers for those datasets present users with numpy S data
> > types, which in Python 3 cannot be compared to str (unicode) literals.
>  In
> > many cases those datasets are large, and in my case I regularly deal with
> > multi-Gb sized bytestring arrays.  Converting those to a U dtype is not
> > practical.
>
> This is feedback is *super* useful, thanks. Can you elaborate a bit
> more on your requirements?
>
> I get that:
> - You have data that is treated as text, so it is convenient to be
> able to use Python strings for things like equality tests, np.sum(arr
> == "green") etc.
> - Your data uses only ASCII characters, and you don't want to spend
> more than 1 byte of memory per character.
>
> Do you ever have 8 bit characters, and if so, what encoding do you use?
>

No.

>
> Does it matter to you that the memory layout for these 1-byte-per-char
> strings remain fixed-width nul-padded concatenated strings (e.g.,
> because you are mmap'ing files that have this format)? Or do FITS/HDF5
> handle layout details internally and you don't care so long as the
> above requirements are met?
>

Yes, memory layout matters since mmap'ing files is a key feature in FITS.


>
> Does the fixed-width nature of numpy strings cause problems in the
> above setting?
>

No.  In particular FITS is ubiquitous as the binary data transport format
in astronomy, and it specifies fixed width strings, so fixed width in numpy
is a good thing in this case.  More generally legacy (or even modern
high-performance) Fortran / C will commonly handle string arrays as arrays
of fixed width characters.  In the majority of cases these codes (that I'm
aware of) know nothing about unicode.

This all works transparently with Python 2 + Numpy, so the goal is to have
that same "it just works" capability in Python 3 with minimal code changes.

Thanks,
Tom


>
> -n
>
> --
> Nathaniel J. Smith
> Postdoctoral researcher - Informatics - University of Edinburgh
> http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String type again.

Reply via email to