Re: [Pytables-users] Some experiences with PyTables

Francesc Alted Wed, 07 Dec 2011 04:59:34 -0800

2011/12/6 Anthony Scopatz <scop...@gmail.com>

> This is a function of the underlying HDF5 storage mechanism and not
> explicitly PyTables.
> When storing fixed length strings, the array of characters it is converted
> to *must* be exactly
> length-N.  When serializing a string of length-M, HDF5 does the following:
>
> 1.  M > N: truncate the string at N bytes (chop off the end).
> 2. M == N: do nothing.
> 3.  M < N:  pad the character array with N - M null characters to achieve
> length N.
>
> Because of this technique, when deserializing all trailing null characters
> are dropped.
> This supports the much more common use case of storing shorter strings in
> a longer
> buffer but wanting to only recover the shorter version.
>


What you are saying is correct, except that the 'guilty' of dropping the
trailing null characters is NumPy, not HDF5.  Look at this:

In [27]: import numpy as np

In [28]: np.array(["aaa"])
Out[28]:
array(['aaa'],
      dtype='|S3')

In [29]: np.array(["aaa\x00\x00"])
Out[29]:
array(['aaa'],
      dtype='|S5')

Of course, this behaviour for NumPy was discussed long time ago during its
introduction (around NumPy 1.0 or so, back in 2006), and people (specially
Travis) found this to be the most convenient for the majority of usages.
If you are interested in getting the trailing bytes, you can always do:

In [53]: a = np.array(["aaa\x00\x00"])

In [54]: a[0]
Out[54]: 'aaa'

In [55]: "".join([chr(i) for i in a.view('b')])
Out[55]: 'aaa\x00\x00'



> If you wanted to append null bytes to the end of the string, you could
> always store the
> python length (M) in another column.
>

Sure, this is another option.

>
> 6. Suppose that the records (key, data1) and (key, data2) are two rows in a
>   table with (key, data1) being a earlier row than (key, data2).  Both
>   records have the same value in the first column.  If a CSIndex is created
>   using the first column, will (key, data1) still be before (key, data2) in
>   the index?  This property is called "stability".  Some sorting algorithms
>   guarantee this and others don't.  Are the sorts in PyTables stable?
>
> I am unsure about the stability of the sorts.  I defer to Francesc here.
>

Nope, the internal sorts in PyTables use the 'quicksort' algorithm found in
NumPy (sort() method), and such algorithm is not stable.  For achieving
'stable' sort one should use the 'mergesort' one, but 1) this is slower and
2) uses considerably more memory.


>
>> 7. The table.append in PyTables behaves like extend in Python. Why?
>>
>> I am assuming you mean Python's list.extend() method.  This is likely
> true because
> the use case of appending a single row is uncommon and so having another
> method
> for it is unnecessary.  Also I think you can get list.append() behavior
> out of the row interface.
>

Yep.  I considered having two different 'append' and 'extend' methods to be
a bit overlapping, so I decided to stick using just one.


>
>> 8. I get a mysterious PerformanceWarning from the PyTables file
>> "table.py",
>>   line 2742. This message needs to be split into two messages.  In my
>> case,
>>   after I appended to a table, "'row' in self.__dict__" was True and
>>   "self.row._getUnsavedNrows()" was 1.  To resolve the problem, I added a
>>   line that flushes the table after every append.  Does
>>   h5file.mytable.flush() do something that h5file.flush() doesn't?  Do I
>>   need to flush every table after every append or are there only certain
>>   situations when this is needed?  What does "preempted from alive nodes"
>>   mean?
>>
>> flushing the file will flush the whole file an all nodes.  Flushing a
> node will simply flush that node.
>
> The rules for how often you need to flush depend on how much memory you
> have, how expensive
> communication to the processor is, etc.
>

Exactly.  What this warning is saying is "Hey, you are trying to write too
much data without flushing, please use flush() from time to time."  You
can, however, enlarge your cache limits (see
http://pytables.github.com/usersguide/parameter_files.html#cache-limits) if
you don't like these messages (but do that at your own risk!).


>
>> 9. Does the following code contain a bug in PyTables?
>>
>>
> Yes, there is a blosc error we are aware of.
>

Fortunately, it is not an error on Blosc in itself, but in the Blosc HDF5
driver that does not interact well with the 'fletcher32' filter.  See:

https://github.com/PyTables/PyTables/issues/21

Until this is fixed the solution is to not use fletcher32 (you need it for
something special?).


>
> Thanks a lot for your constructive feedback.  Seriously, if there are
> things which you would like to
> see changed in PyTables and want to help out, we try to be very receptive.
>  Pull requests on github
> are a great way to get our attention!
>

Also, please take some time to subscribe to the list if you want to avoid
Josh resending your posts every time you write. Thanks.

-- 
Francesc Alted

------------------------------------------------------------------------------
Cloud Services Checklist: Pricing and Packaging Optimization
This white paper is intended to serve as a reference, checklist and point of 
discussion for anyone considering optimizing the pricing and packaging model 
of a cloud services business. Read Now!
http://www.accelacomm.com/jaw/sfnl/114/51491232/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Some experiences with PyTables

Reply via email to