[Pytables-users] how to store data? advice needed

Gerrit Holl Mon, 18 Oct 2010 07:39:37 -0700

Hi,

This is not really a PyTables question, but I think there is expertise
here and I will probably be using PyTables, so I ask anyway.


I am a PhD student in atmospheric remote sensing. I work with data
analysis and work with quite large amounts of data. Until now, I have
been using Matlab, but I'm getting more and more fed-up with it, so I
want to look back at my childhood love, Python.

My work consists of collocating different satellite datasets and then
using the collocations for various purposes. For example, I collocate
level 2B data from the CloudSat Cloud Profiling Radar with radiances
from the Atmospheric Microwave Sounding Unit and the Microwave
Humidity Sounder; those interested can find more information in: Holl,
G., Buehler, S. A., Rydberg, B., and Jiménez, C.: Collocating
satellite-based radar and radiometer measurements – methodology and
usage examples, Atmos. Meas. Tech., 3, 693-708,
doi:10.5194/amt-3-693-2010, 2010. Available online at
http://www.atmos-meas-tech.net/3/693/2010/amt-3-693-2010.html

As a side effect of deciding to look into Python, I came across
pytables and started thinking about strategies of data storage. In
earth observation, measurements are usually stored per orbit. I store
my collocations per day, in gzipped NetCDF files. If I want to get all
collocations that are nadir-looking, equatorial and have brightness
temperatures in a specified range, for one year of data, I need to
uncompress a .nc.gz file for every day of the year, read the data,
select the data that I need, and go on. The files are stored on the
server, it may take 10 minutes to select ten thousand measurements out
of ten million in total. Most of the time is spent reading files from
the network and uncompressing them.

Reading about PyTables gives me the impression that it can be done
better. Am I right? How? Suppose I would use PyTables, how would I
best store my data? How large chunks should I store it in? One file
per day? One file for everything? Something in-between? If I have one
file for everything, it's less easy to take a few files "home" to
experiment with. And if I want data only for a certain part of the
year, I don't even need to look into the directories containing data
for other periods. But to select equatorial measurements throughout
the year, I need to uncompress every single file. What way to go? What
are the pros and cons of storing larger or smaller chunks in one file?
If it's best to store it in one big file, why does nobody in this
community appear to be doing that? It's always one per orbit for
level-1 and level-2, one per day or one per month for level-3 (gridded
data)... are we all being inefficient, or am I missing something? Do
all answers depend strongly on use case?

Discussion or pointers to tutorials on the subject very welcome :)

Gerrit.

------------------------------------------------------------------------------
Download new Adobe(R) Flash(R) Builder(TM) 4
The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly 
Flex(R) Builder(TM)) enable the development of rich applications that run
across multiple browsers and platforms. Download your free trials today!
http://p.sf.net/sfu/adobe-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] how to store data? advice needed

Reply via email to