Hi Uche, [I'm ccying to the pytables user's list just in case others want to collaborate in the effort of implementing unicode attributes]
A Tuesday 31 August 2010 18:36:02 escriguéreu: > Ok, I have been digging into the code a bit and I see some possible > ways of achieving unicode support for hdf5 attributes. First of all, the > character set of the string data type has to be changed. The HDF5 api > already provides this. But the question is if I should create a new > "unicode" datatype (just by copying the native string datatype) or > directly change the character set of the native string datatype. This > would imply that every string in the HDF5 file has to be stored utf-8 > encoded. I don't know all the impact of this modification yet. If people > strictly used the 7bit ASCII charset so far, there wouldn't be any > conflicts with UTF-8. But I don't think this is the case, especially in > Europe. > Then, when it comes to the implementation of the attribute set, I > noticed that the attribute data is preferably collected into a numpy > array and then passed to the function which does the actual interfacing > with the hdf5 api (_g_setAttr). I see that this approach has the > advantage of avoiding handling too many special cases for the different > possible python data types. However, it means that a unicode string has > to be encoded (for numpy), decoded and then encoded again (for HDF5, > which is UTF-8). Since an attribute set is not (should not) be subject > to critical performance, I would leave it like that and only minimal > changes of the code are necessary. Yes, this is the approach that I'd take. NumPy does use UCS-4 for encoding unicode data internally while HDF5 uses UTF-8. So, it's only a matter of encoding into UTF-8 for writing and decoding for reading. You can use the codecs module in Python to facilitate this task (I suppose that you will need a unicode-aware mail user agent to read this): In [1]: import codecs In [2]: import numpy as np Let's create an Unicode array: In [3]: a = np.array([u"açò i allò"]) Now, get the codec object of UTF-8: In [4]: c = codecs.lookup('utf-8') In [8]: c.encode(arr[0]) Out[8]: ('a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2', 13) the value to pass into HDF5 is the first element of this tuple: In [9]: value_to_write = c.encode(arr[0])[0] In [10]: value_to_write Out[10]: 'a\xc3\x83\xc2\xa7\xc3\x83\xc2\xb2 i all\xc3\x83\xc2\xb2' And for reading, it is just a matter of using .decode() for UTF-8: In [11]: c.decode(value_to_write) Out[11]: (u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2', 19) In [12]: read_value = c.decode(value_to_write)[0] In [13]: read_value Out[13]: u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2' which can be converted back into a NumPy array: In [14]: np.array([read_value]) Out[14]: array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'], dtype='<U13') We can check that, after the roundtrip, this is the same than the original array: In [15]: arr Out[15]: array([u'a\xc3\xa7\xc3\xb2 i all\xc3\xb2'], dtype='<U13') In [1]: import numpy as np In [2]: import codecs In [7]: arr = np.array([u"Açó i allò"]) In [8]: arr[0] Out[8]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2' In [9]: c = codecs.lookup('utf-8') and use it for encoding the first element of the array: In [10]: c.encode(arr[0]) Out[10]: ('A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2', 13) so, what you have to pass into HDF5 is the first element of the above tuple: In [11]: c.encode(arr[0])[0] Out[11]: 'A\xc3\x83\xc2\xa7\xc3\x83\xc2\xb3 i all\xc3\x83\xc2\xb2' In [13]: c.decode(np.array([c.encode(arr[0])[0]])[0]) Out[13]: (u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2', 19) In [14]: c.decode(np.array([c.encode(arr[0])[0]])[0])[0] Out[14]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2' In [15]: u"Açó i allò" Out[15]: u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2' In [21]: np.array([c.decode(np.array([c.encode(arr[0])[0]])[0])[0]]) Out[21]: array([u'A\xc3\xa7\xc3\xb3 i all\xc3\xb2'], dtype='<U13') That might seem a bit complicated, but thanks to the codecs module it is not. Luck! > On 12.08.2010 13:57, Francesc Alted wrote: > > 2010/8/12, Uche Mennel<men...@even-ag.ch>: > >> Ok, I agree that the resulting variable length representation of UTF-8 > >> encoded strings is not optimal for storing large blocks of data (btw. I > >> didn't even know that numpy supports unicode at all). Actually, I was > >> talking about supporting UTF-8 strings in meta data like e.g. attribute > >> names and values. I don't see any reason why UTF-8 encoded string should > >> be problematic there. And, since it is natively supported by HDF5, it > >> should be easy to implement I guess. > > > > Ahh, definitely, supporting UTF-8 in attributes would be relatively easy. > > > >> As I already explained, that is not my intention. I would be rather > >> interested in a patch for storing meta data strings as unicode strings. > > > > Ok then. So have a look at how different datatypes are implemented in > > the AttributeSet extension in hdf5Extension.pyx to get an idea on what > > you should touch to implement this. Tell me if you get into > > difficulties. > > > >> Btw. I think that creating VLUTF8Atom would give rise to renaming > >> VLUnicodeAtom to VL<the current encoding>Atom and that would imply a > >> major change of the API. > > > > True. I should think a bit more about this (in case we want to really > > implement it). > > > > Best, > -- Francesc Alted ------------------------------------------------------------------------------ This SF.net Dev2Dev email is sponsored by: Show off your parallel programming skills. Enter the Intel(R) Threading Challenge 2010. http://p.sf.net/sfu/intel-thread-sfd _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users