Hi Uche,

[I'm ccying to the pytables user's list just in case others want to 
collaborate in the effort of implementing unicode attributes]

A Tuesday 31 August 2010 18:36:02 escriguéreu:
>   Ok, I have been digging into the code a bit and I see some possible
> ways of achieving unicode support for hdf5 attributes. First of all, the
> character set of the string data type has to be changed. The HDF5 api
> already provides this. But the question is if I should create a new
> "unicode" datatype (just by copying the native string datatype) or
> directly change the character set of the native string datatype. This
> would imply that every string in the HDF5 file has to be stored utf-8
> encoded. I don't know all the impact of this modification yet. If people
> strictly used the 7bit ASCII charset so far, there wouldn't be any
> conflicts with UTF-8. But I don't think this is the case, especially in
> Europe.
> Then, when it comes to the implementation of the attribute set, I
> noticed that the attribute data is preferably collected into a numpy
> array and then passed to the function which does the actual interfacing
> with the hdf5 api (_g_setAttr). I see that this approach has the
> advantage of avoiding handling too many special cases for the different
> possible python data types. However, it means that a unicode string has
> to be encoded (for numpy), decoded and then encoded again (for HDF5,
> which is UTF-8). Since an attribute set is not (should not) be subject
> to critical performance, I would leave it like that and only minimal
> changes of the code are necessary.

Yes, this is the approach that I'd take.  NumPy does use UCS-4 for encoding 
unicode data internally while HDF5 uses UTF-8.  So, it's only a matter of 
encoding into UTF-8 for writing and decoding for reading.

You can use the codecs module in Python to facilitate this task (I suppose 
that you will need a unicode-aware mail user agent to read this):

In [1]: import codecs

In [2]: import numpy as np

Let's create an Unicode array:

In [3]: a = np.array([u"açò i allò"])

Now, get the codec object of UTF-8:

In [4]: c = codecs.lookup('utf-8')

In [8]: c.encode(arr[0])
Out[8]: ('a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2', 13)

the value to pass into HDF5 is the first element of this tuple:

In [9]: value_to_write = c.encode(arr[0])[0]

In [10]: value_to_write
Out[10]: 'a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2'

And for reading, it is just a matter of using .decode() for UTF-8:

In [11]: c.decode(value_to_write)
Out[11]: (u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2', 19)

In [12]: read_value = c.decode(value_to_write)[0]

In [13]: read_value
Out[13]: u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'

which can be converted back into a NumPy array:

In [14]: np.array([read_value])
Out[14]:
array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'],
      dtype='<U13')

We can check that, after the roundtrip, this is the same than the original 
array:

In [15]: arr
Out[15]:
array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'],
      dtype='<U13')


In [1]: import numpy as np

In [2]: import codecs

In [7]: arr = np.array([u"Açó i allò"])

In [8]: arr[0]
Out[8]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'

In [9]: c = codecs.lookup('utf-8')

and use it for encoding the first element of the array:

In [10]: c.encode(arr[0])
Out[10]: ('A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2', 13)

so, what you have to pass into HDF5 is the first element of the above tuple:

In [11]: c.encode(arr[0])[0]
Out[11]: 'A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2'


In [13]: c.decode(np.array([c.encode(arr[0])[0]])[0])
Out[13]: (u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2', 19)

In [14]: c.decode(np.array([c.encode(arr[0])[0]])[0])[0]
Out[14]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'

In [15]: u"Açó i allò"
Out[15]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'

In [21]: np.array([c.decode(np.array([c.encode(arr[0])[0]])[0])[0]])
Out[21]:
array([u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'],
      dtype='<U13')

That might seem a bit complicated, but thanks to the codecs module it is not.

Luck!

> On 12.08.2010 13:57, Francesc Alted wrote:
> > 2010/8/12, Uche Mennel<men...@even-ag.ch>:
> >> Ok, I agree that the resulting variable length representation of UTF-8
> >> encoded strings is not optimal for storing large blocks of data (btw. I
> >> didn't even know that numpy supports unicode at all). Actually, I was
> >> talking about supporting UTF-8 strings in meta data like e.g. attribute
> >> names and values. I don't see any reason why UTF-8 encoded string should
> >> be problematic there. And, since it is natively supported by HDF5, it
> >> should be easy to implement I guess.
> >
> > Ahh, definitely, supporting UTF-8 in attributes would be relatively easy.
> >
> >> As I already explained, that is not my intention. I would be rather
> >> interested in a patch for storing meta data strings as unicode strings.
> >
> > Ok then.  So have a look at how different datatypes are implemented in
> > the AttributeSet extension in hdf5Extension.pyx to get an idea on what
> > you should touch to implement this.  Tell me if you get into
> > difficulties.
> >
> >> Btw. I think that creating VLUTF8Atom would give rise to renaming
> >> VLUnicodeAtom to VL<the current encoding>Atom and that would imply a
> >> major change of the API.
> >
> > True.  I should think a bit more about this (in case we want to really
> > implement it).
> >
> > Best,
> 

-- 
Francesc Alted

------------------------------------------------------------------------------
This SF.net Dev2Dev email is sponsored by:

Show off your parallel programming skills.
Enter the Intel(R) Threading Challenge 2010.
http://p.sf.net/sfu/intel-thread-sfd
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to