Hi Jeff,

Definitely, I'd agree that Samba could perform better here, but I'm not 
very knowledgeable about Samba, sorry.  I'm CC'ing to the PyTables 
user's list, just in case somebody else wants to chime in.

Not sure about the string limitation in numexpr.  Could you please be a 
bit more explicit?  Perhaps that should be fixed.

Francesc

A Sunday 17 April 2011 05:35:23 escriguéreu:
> Francesc,
> 
> I have a particular use case with pytables that is not covered in the
> optimization section of the user manual; thought you might want to
> take a look and potentially include in your benchmarks
> 
> I have a number of files
> where I store data where each row has a datetime, string, and a float
> array. I generally write/append the data on a central server (linux
> 64bit) onto local disk; reading is done over a LAN by Linux via nfs
> and windows machines via samba
> 
> I do in kernel searches to select on the datetimes (unfortunately I
> cannot seem to do in kernel selections for my strings if i have a
> lot of strings to sub-select on; I limit to say 256 total lenth of
> strings in the in kernel search (otherwise I have to bring the data
> to python space and filter), to avoid a limitation in numexpr - I
> believe the total length of the string - but this is a side issue)
> 
> My sample file with no compressions is 6m rows about 450mb
> 
> Reading across the samba link took about 25.5s (clock time including
> some object creation at the end (I use pandas to hold the resulting
> dataset) - for about 1/3 of the dataset with an in kernel search
> 
> Across the nfs link is faster about 16s
> Locally even faster of course - about 15s
> 
> I tried first zlib (level 9) and then blosc (level 9) next
> 
> What is interesting is that my samba link read is now about 23s for
> blosc (and of course the file size is about 1/3)
> 
> (nfs is slightly less and local read is slightly more - a result of
> the decmpression overhead)
> 
> So pretty good improvement in read speed across the samba share (note
> that there must besome caching that the os must be doing here -
> first time reading across the samba are Way slower - times I have
> given are for the second and subsequent reads)
> 
> These speeds are fine for me as I usually read this data on my
> program startup
> 
> This is an even more extreme example of CPU vs disk speed (in this
> case the disk speed is much slower than a 'real' disc as it is
> really a network) - and so compression really wins here!
> 
> So this is a weird result - different clients of the same dataset
> served by the same machine have different ways that want the dataset
> stored in order to optimize read speed!
> 
> Any thoughts on how to optimize a samba share for this?
> 
> 
> Thanks
> 
> Jeff

-- 
Francesc Alted

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to