Hi Uche,
[I'm ccying to the pytables user's list just in case others want to
collaborate in the effort of implementing unicode attributes]
A Tuesday 31 August 2010 18:36:02 escriguéreu:
> Ok, I have been digging into the code a bit and I see some possible
> ways of achieving unicode support for hdf5 attributes. First of all, the
> character set of the string data type has to be changed. The HDF5 api
> already provides this. But the question is if I should create a new
> "unicode" datatype (just by copying the native string datatype) or
> directly change the character set of the native string datatype. This
> would imply that every string in the HDF5 file has to be stored utf-8
> encoded. I don't know all the impact of this modification yet. If people
> strictly used the 7bit ASCII charset so far, there wouldn't be any
> conflicts with UTF-8. But I don't think this is the case, especially in
> Europe.
> Then, when it comes to the implementation of the attribute set, I
> noticed that the attribute data is preferably collected into a numpy
> array and then passed to the function which does the actual interfacing
> with the hdf5 api (_g_setAttr). I see that this approach has the
> advantage of avoiding handling too many special cases for the different
> possible python data types. However, it means that a unicode string has
> to be encoded (for numpy), decoded and then encoded again (for HDF5,
> which is UTF-8). Since an attribute set is not (should not) be subject
> to critical performance, I would leave it like that and only minimal
> changes of the code are necessary.
Yes, this is the approach that I'd take. NumPy does use UCS-4 for encoding
unicode data internally while HDF5 uses UTF-8. So, it's only a matter of
encoding into UTF-8 for writing and decoding for reading.
You can use the codecs module in Python to facilitate this task (I suppose
that you will need a unicode-aware mail user agent to read this):
In [1]: import codecs
In [2]: import numpy as np
Let's create an Unicode array:
In [3]: a = np.array([u"açò i allò"])
Now, get the codec object of UTF-8:
In [4]: c = codecs.lookup('utf-8')
In [8]: c.encode(arr[0])
Out[8]: ('a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2', 13)
the value to pass into HDF5 is the first element of this tuple:
In [9]: value_to_write = c.encode(arr[0])[0]
In [10]: value_to_write
Out[10]: 'a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2'
And for reading, it is just a matter of using .decode() for UTF-8:
In [11]: c.decode(value_to_write)
Out[11]: (u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2', 19)
In [12]: read_value = c.decode(value_to_write)[0]
In [13]: read_value
Out[13]: u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'
which can be converted back into a NumPy array:
In [14]: np.array([read_value])
Out[14]:
array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'],
dtype='<U13')
We can check that, after the roundtrip, this is the same than the original
array:
In [15]: arr
Out[15]:
array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'],
dtype='<U13')
In [1]: import numpy as np
In [2]: import codecs
In [7]: arr = np.array([u"Açó i allò"])
In [8]: arr[0]
Out[8]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'
In [9]: c = codecs.lookup('utf-8')
and use it for encoding the first element of the array:
In [10]: c.encode(arr[0])
Out[10]: ('A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2', 13)
so, what you have to pass into HDF5 is the first element of the above tuple:
In [11]: c.encode(arr[0])[0]
Out[11]: 'A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2'
In [13]: c.decode(np.array([c.encode(arr[0])[0]])[0])
Out[13]: (u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2', 19)
In [14]: c.decode(np.array([c.encode(arr[0])[0]])[0])[0]
Out[14]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'
In [15]: u"Açó i allò"
Out[15]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'
In [21]: np.array([c.decode(np.array([c.encode(arr[0])[0]])[0])[0]])
Out[21]:
array([u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'],
dtype='<U13')
That might seem a bit complicated, but thanks to the codecs module it is not.
Luck!
> On 12.08.2010 13:57, Francesc Alted wrote:
> > 2010/8/12, Uche Mennel<[email protected]>:
> >> Ok, I agree that the resulting variable length representation of UTF-8
> >> encoded strings is not optimal for storing large blocks of data (btw. I
> >> didn't even know that numpy supports unicode at all). Actually, I was
> >> talking about supporting UTF-8 strings in meta data like e.g. attribute
> >> names and values. I don't see any reason why UTF-8 encoded string should
> >> be problematic there. And, since it is natively supported by HDF5, it
> >> should be easy to implement I guess.
> >
> > Ahh, definitely, supporting UTF-8 in attributes would be relatively easy.
> >
> >> As I already explained, that is not my intention. I would be rather
> >> interested in a patch for storing meta data strings as unicode strings.
> >
> > Ok then. So have a look at how different datatypes are implemented in
> > the AttributeSet extension in hdf5Extension.pyx to get an idea on what
> > you should touch to implement this. Tell me if you get into
> > difficulties.
> >
> >> Btw. I think that creating VLUTF8Atom would give rise to renaming
> >> VLUnicodeAtom to VL<the current encoding>Atom and that would imply a
> >> major change of the API.
> >
> > True. I should think a bit more about this (in case we want to really
> > implement it).
> >
> > Best,
>
--
Francesc Alted
------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:
Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users