2012/1/2 Gael Varoquaux <gael.varoqu...@normalesup.org>:
> On Mon, Jan 02, 2012 at 06:44:54PM +0100, Francesc Alted wrote:
>> Perhaps you may get a bit more performance if you use the
>> `[read,write]_vl_blosc2_hdf` functions that I have sent in my earlier
>> post, but that adds the python-blosc dependency (available at
>> http://pypi.python.org/pypi/blosc/1.0.3), so yeah, that might be a bit
>> 'exotic'.
>
> Yes, it is exotic, but it would make sens that it should be faster. Your
> code definetely looks very suited to what I want to do. I can see from
> the benchmarks that you did that I manages to squeeze 10 to 30% from the
> CArrays using blosc that I tried. I didn't bench this strategy, because I
> didn't want to add a dependency to a package that does not already have a
> large installation base. Maybe I'll add the benchmarks in the blog post,
> if I find time (I am already way overtime on this), but anyhow, I'll
> point to your email on this mailing list.
>
> I don't think that you expose the blosc bindings in pytables. Thus to use
> this strategy, I would have to have blosc installed on all my computers.

That is correct.  python-blosc is an external library that somewhat
copies the zlib module in Python library.

> For my own interest, and hopefully later inclusion of such a strategy in
> my codebase, is there a way of getting closer to such performance without
> requiring Python access to blosc?

To say the truth, Blosc will be mainly useful to achieve speed-ups for
doing I/O to the memory, not to the disk (your main goal, right).  So,
if your interest is really disk I/O then I would not pay Blosc too
much attention and would concentrate on zlib, which is everywhere, and
more than enough for these purposes (except if you have disks with
extraordinary throughput, like large cabinets of disks, or RAIDs of
SSD disks, which are becoming quite affordable lately).

>> Also, as your datasets are pretty small, you may want to add a warning
>> about the fact that these benchmarks are mainly doing I/O against
>> memory, not disk.
>
> Right, I should point out that I am benching only datasets that hold in
> memory, whereas pytables is especially interesting for datasets that do
> not hold in memory, isn't it?

Not exactly.  Datasets that fits in-memory are also an important
target for PyTables, after all, Blosc has been made for a reason :)

> But I am curious at your statement that I
> am mostly benching memory, not disk. I have indeed observed that using a
> USB disk didn't affect performance too much, and I was quite surprised.
> Do you have an explaination of why the disk bandwidth does seem too
> relevent?

Well, when you are writing, the OS sends all your data to memory
buffers prior to disk write completion.  If data fits in memory, then
the OS will tell you that it has already finished the task, when in
reality data is still waiting to be written.  You can force the OS to
effectively write all pending operations by issuing a 'sync' command.

For reading, look into my next reply.

-- 
Francesc Alted

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to