Hi Francesc,

thank you for the very prompt reply; your explanation makes sense and  
I can see where this is coming from.

Most likely, I will modify our code to strip off any white space in  
the relevant places: this is not particularly nice but will guarantee  
backwards compatibility with older files. This is only relevant for  
small sets of metadata so it is not performance critical.

In this context: Is there any way to extract from a pytables.h5 file  
the information whether it was written with pytables1.0 or pytables  
2.0? If I had that information, I could use tidy and lean code for  
the newer files.

Many thanks,

Hans


On 2 Jul 2008, at 13:29, Francesc Alted wrote:

> A Wednesday 02 July 2008, Hans Fangohr escrigué:
>> Dear pytables developers,
>>
>> I have come across an oddity when I read pytables files written with
>> pytables 1.0 and read with pytables 2.0, and would like your opinion
>> on this.
>>
>> What I consider a problem for my application is outlined below, and I
>> wanted to make sure that you are aware of this behaviour (it could be
>> a bug).
>
> Could be, but unfortunately, this is the intended behaviour.  It is  
> not
> easy to explain, but I'll try it below.
>
>>
>> I create two h5 files using the following piece of code (once run
>> with pytables 1.3.2 and once with pytables 2.0.3)::
>>
>>
>>    import numpy
>>    import tables
>>
>>    print "Running with tables version %s" % tables.__version__
>>
>>    master_version = int(tables.__version__[0])
>>
>>    if master_version == 1:
>>        fileName = 'tbl1.h5'
>>    elif master_version == 2:
>>        fileName = 'tbl2.h5'
>>    else:
>>        raise "Impossible"
>>
>>    h5f = tables.openFile(fileName, 'w')
>>
>>    if master_version == 1:
>>        class Particle(tables.IsDescription):
>>          name      = tables.StringAtom(length=16)
>>          idnumber  = tables.Int64Atom()
>>    else:
>>        class Particle(tables.IsDescription):
>>            name      = tables.StringCol(16)
>>            idnumber  = tables.Int64Col()
>>
>>    my_table = h5f.createTable(h5f.root, 'test', Particle,
>>                          "watch the white space")
>>
>>    particle = my_table.row
>>    particle['name']="A"
>>    particle['idnumber']=01234
>>    particle.append()
>>
>>    particle['name']="BCD"
>>    particle['idnumber']=56789
>>    particle.append()
>>
>>    my_table.flush
>>
>>    h5f.close()
>>
>>
>> Subsequently, I run the following piece of code with pytables 2.0::
>>
>>    for fileName,version in [('tbl1.h5','tables1.3.2'),
>> ('tbl2.h5','tables2.0.3')]:
>>        print "Reading the file written with %s" % version
>>        h5f = tables.openFile(fileName)
>>        for row in h5f.root.test.iterrows():
>>            print row
>>        h5f.close()
>>
>>
>> I get the following output::
>>
>>
>>    Running with tables version 2.0.3
>>    Reading the file written with tables1.3.2
>>    (668, 'A               ')
>>    (56789, 'BCD             ')
>>    Reading the file written with tables2.0.3
>>    (668, 'A')
>>    (56789, 'BCD')
>>
>>
>> Note that the strings 'A' and 'BCD' are returned with whitespace
>> (filling up to the maximum length of the string) when reading the
>> file written with pytables 1.0 but not when reading the file written
>> with pytables 2.0.
>>
>> I believe that the desired behaviour is not to return the white
>> space.
> [snip]
>
> Yeah, but this is not easy to do.  The root of the problem is the
> different padding conventions that numarray (the array package at the
> core of PyTables 1.x series) and NumPy (the one at the core of  
> PyTables
> 2.x series) do have.
>
> As you already may know, both numarray and NumPy implement string  
> arrays
> as *fixed* length datatypes (mainly for performance reasons).  So, if
> you have defined an array of strings having a length up to, say, 4
> chars per element, and you want to save a, say, empty string, you will
> have to define how to fill-in the remaining 4 chars.  And this is  
> where
> numarray and NumPy critically diverged: numarray had chosen a Fortran
> convention and filled the unused space with *white spaces* while NumPy
> has chosen the C convention and filled the unused space with NULL
> chars.
>
> Here it is an example of the above:
>
> In [43]: nas = numarray.strings.array("", itemsize=4)
>
> In [44]: str(nas._data)
> Out[44]: '    '
>
> In [45]: nps = numpy.array("", dtype="S4")
>
> In [46]: str(nps.data)
> Out[46]: '\x00\x00\x00\x00'
>
> After realizing that this could led into problems between files  
> created
> with PyTables 1.x and reading them with PyTables 2.x I've tried to
> convince the numarray crew to change their default for padding to be
> the NULL char, but I had no success (they had to maintain the Fortran
> convention for compatibility with many code they've already made).
> However, they allowed me to introduce a new parameter, called 'padc',
> in the string array factory so that you can choose the padding char.
> Here it is how it works:
>
> In [47]: padded_nas = numarray.strings.array("", itemsize=4,
> padc="\x00")
>
> In [48]: str(padded_nas._data)
> Out[48]: '\x00\x00\x00\x00'
>
> [See the thread:
> http://projects.scipy.org/pipermail/numpy-discussion/2005-January/ 
> 003781.html
> for more info about the patch]
>
> So, perhaps, if you need to use PyTables 1.x in some places to create
> files that are to be read by PyTables 2.x in other places, instead of:
>
> particle['name'] = "A"
>
> you can write:
>
> particle['name'] = numarray.strings.array("A", itemsize=16,  
> padc="\x00")
>
> this way, your files written with PyTables 1.x would be read correctly
> with PyTables 2.x (i.e. without the padding issue).
>
> However, if what you want is to be able to read existing PyTables 1.x
> files with PyTables 2.x without padding issues, then I'm afraid that
> you are out of luck.  In that case, your best bet would be that you  
> end
> with a tool for doing the conversion by yourself -- incidentally,
> the 'correct' way to do this would be to hack the ptrepack utility
> which comes with PyTables so as to do this automatically.
>
> Hope that helps,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>
> ---------------------------------------------------------------------- 
> ---
> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
> Studies have shown that voting for your favorite open source project,
> along with a healthy diet, reduces your potential for chronic lameness
> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>




-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to