Re: [Pytables-users] Some experiences with PyTables

Anthony Scopatz Tue, 06 Dec 2011 14:07:11 -0800

Hello Edward,

I'd like to respond point by point:


On Tue, Dec 6, 2011 at 2:54 PM, PyTables Org <pytab...@googlemail.com>wrote:

> 1. There seems to be an unpythonic design choice with the start, stop, step
>   convention for PyTables.  Anything that is unnatural to a Python
>   programmer should be heavily documented.
>
> Agreed in general.  Do you have specific example we could address?


> 2. There may be a bug in itersorted.
>
>
Yes, this looks like a bug.  This deserves an issue on github...


>
> Here is code for (1) and (2):
> ----
> #! /usr/bin/env python
>
> import random, tables
>
> h5file = tables.openFile('mytable.h5', mode='w')
>
> class silly_class(tables.IsDescription):
>    num = tables.Int32Col(pos=0)
>
> mytable = h5file.createTable(h5file.root, 'mytable', silly_class,
>         'a few ints', expectedrows=4)
>
> row = mytable.row
> for i in range(10):
>    row['num'] = random.randint(0, 99)
>    row.append()
> mytable.flush()
> mytable.cols.num.createCSIndex()
>
> # Python's idiom for start, stop, step:
> print 'Python:', range(9, -1, -1)
>
> output = mytable.readSorted('num', start=0, stop=10, step=-1)
> print 'readSorted:', 0, 10, -1, output
>
> # copy supports a negative step.  It seems that start and stop are applied
> # _after_ the sort is done.  Very unlike Python.  Please document
> thoroughly.
>
> We could certainly add some text to the docstring of Table.copy().  Still,
I guess I am
missing how this is 'wrong.' To the best of my knowledge, Python itself has
no single
function which both sorts and slices. (Please correct me if I am wrong
~_~.)  When
performing both operations one needs to be done first.  However, you are
correct in that
this could be better documented.


> print h5file.root.mytable[:]
> h5file.root.mytable.copy(h5file.root, 'mytable2', sortby='num',
>             start=0, stop=5, step=-1)
> print h5file.root.mytable2[:]
>
> # The following raises an OverflowError.  The documentation (2.3.1) says
> # negative steps are supported for itersorted.  Documentation error or bug
> # in itersorted?
> output = [x['num'] for x in mytable.itersorted('num',
>             start=0, stop=10, step=-1)]
> print 'itersorted:', 0, 10, -1, output
> ----
> 3. Null bytes are stripped from the end of strings when they are stored in
> a
>   table.  Since a Python does not expect this, it needs to be explicitly
>   documented in all the relevant places.  Here is some code:
>
> This is a function of the underlying HDF5 storage mechanism and not
explicitly PyTables.
When storing fixed length strings, the array of characters it is converted
to *must* be exactly
length-N.  When serializing a string of length-M, HDF5 does the following:

1.  M > N: truncate the string at N bytes (chop off the end).
2. M == N: do nothing.
3.  M < N:  pad the character array with N - M null characters to achieve
length N.

Because of this technique, when deserializing all trailing null characters
are dropped.
This supports the much more common use case of storing shorter strings in a
longer
buffer but wanting to only recover the shorter version.

If you wanted to append null bytes to the end of the string, you could
always store the
python length (M) in another column.

>  ----
> #! /usr/bin/env python
>
> import tables
>
> def hash2hex(stringin):
>    out = list()
>    for c in stringin:
>        s = hex(ord(c))[2:]
>        if len(s) == 1:
>            s = '0' + s
>        out.append(s)
>    return ''.join(out)
>
> h5file = tables.openFile('mytable.h5', mode='w')
>
> class silly_class(tables.IsDescription):
>    astring = tables.StringCol(16, pos=0)
>
> mytable = h5file.createTable(h5file.root, 'mytable', silly_class,
>         'a few strings', expectedrows=4)
>
> # Problem when string ends with null bytes:
> nasty = 'abdcef' + '\x00\x00'
> print repr(nasty)
> print hash2hex(nasty)
>
> row = mytable.row
> row['astring'] = nasty
> row.append()
> mytable.flush()
> print repr(mytable[0][0])
> print hash2hex(mytable[0][0])
> h5file.close()
> ----
> 4. Has the 64K limit for attributes been lifted?
>
>
No, unfortunately.  Once again, this is a compile time parameter of HDF5.
You could change this value and recompile HDF5, but then any h5 file you
create would not be portable with other versions of HDF5.  Trust me, you
are
not the only one who wishes this were as run-time variable.  (Still there
are good
reasons for it being static, ie speed and size)


>
> 5. The reference manual for numpy contains _many_ small examples.  They
>   partially compensate for any lack of precision or excessive precision in
>   the documents.  Also many people learn best from examples.
>
>
If you would like to write up some additional example or contribute to the
docs in any way
*please* let me know.  We would be ecstatic for your help!


>
> 6. Suppose that the records (key, data1) and (key, data2) are two rows in a
>   table with (key, data1) being a earlier row than (key, data2).  Both
>   records have the same value in the first column.  If a CSIndex is created
>   using the first column, will (key, data1) still be before (key, data2) in
>   the index?  This property is called "stability".  Some sorting algorithms
>   guarantee this and others don't.  Are the sorts in PyTables stable?
>
> I am unsure about the stability of the sorts.  I defer to Francesc here.


>
> 7. The table.append in PyTables behaves like extend in Python. Why?
>
> I am assuming you mean Python's list.extend() method.  This is likely true
because
the use case of appending a single row is uncommon and so having another
method
for it is unnecessary.  Also I think you can get list.append() behavior out
of the row interface.

>
> 8. I get a mysterious PerformanceWarning from the PyTables file "table.py",
>   line 2742. This message needs to be split into two messages.  In my case,
>   after I appended to a table, "'row' in self.__dict__" was True and
>   "self.row._getUnsavedNrows()" was 1.  To resolve the problem, I added a
>   line that flushes the table after every append.  Does
>   h5file.mytable.flush() do something that h5file.flush() doesn't?  Do I
>   need to flush every table after every append or are there only certain
>   situations when this is needed?  What does "preempted from alive nodes"
>   mean?
>
> flushing the file will flush the whole file an all nodes.  Flushing a node
will simply flush that node.

The rules for how often you need to flush depend on how much memory you
have, how expensive
communication to the processor is, etc.


>
> 9. Does the following code contain a bug in PyTables?
>
>
Yes, there is a blosc error we are aware of.

Thanks a lot for your constructive feedback.  Seriously, if there are
things which you would like to
see changed in PyTables and want to help out, we try to be very receptive.
 Pull requests on github
are a great way to get our attention!

Be Well
Anthony


> ----
> #! /usr/bin/env python
>
> import sys
> import numpy, tables
>
> # No failure if projections and winsize are small enough.  In the original
> # program, gauss.shape is (2000, 196, 196).
> projections = 105
> winsize = 2500
> h5 = tables.openFile('mess.h5', mode='w')
> shape = (projections, winsize)
>
> # If  "complib='blosc'" and "fletcher32=True" failure occurs. Other
> # combinations work.
> filters = tables.Filters(complevel=1, complib='blosc', fletcher32=True)
> h5.createCArray(h5.root, 'gauss', tables.Float64Atom(),
>           shape, title='gauss', filters=filters)
> print 'chunkshape:', h5.root.gauss.chunkshape  # (3, 2500)
> sys.stdout.flush()
> #h5.root.gauss.flush()  # Flushes make no difference.
>
> zeros = numpy.zeros( (winsize,), numpy.float64)
> for i in range(projections):
>    h5.root.gauss[i, :] = zeros
> #h5.root.gauss.flush()
>
> xx = h5.root.gauss[0, :]
> print 'xx.shape:', xx.shape, 'gauss.shape', h5.root.gauss.shape
> ----
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Cloud Services Checklist: Pricing and Packaging Optimization
> This white paper is intended to serve as a reference, checklist and point
> of
> discussion for anyone considering optimizing the pricing and packaging
> model
> of a cloud services business. Read Now!
> http://www.accelacomm.com/jaw/sfnl/114/51491232/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Some experiences with PyTables

Reply via email to