Hi, This is not really a PyTables question, but I think there is expertise here and I will probably be using PyTables, so I ask anyway.
I am a PhD student in atmospheric remote sensing. I work with data analysis and work with quite large amounts of data. Until now, I have been using Matlab, but I'm getting more and more fed-up with it, so I want to look back at my childhood love, Python. My work consists of collocating different satellite datasets and then using the collocations for various purposes. For example, I collocate level 2B data from the CloudSat Cloud Profiling Radar with radiances from the Atmospheric Microwave Sounding Unit and the Microwave Humidity Sounder; those interested can find more information in: Holl, G., Buehler, S. A., Rydberg, B., and Jiménez, C.: Collocating satellite-based radar and radiometer measurements – methodology and usage examples, Atmos. Meas. Tech., 3, 693-708, doi:10.5194/amt-3-693-2010, 2010. Available online at http://www.atmos-meas-tech.net/3/693/2010/amt-3-693-2010.html As a side effect of deciding to look into Python, I came across pytables and started thinking about strategies of data storage. In earth observation, measurements are usually stored per orbit. I store my collocations per day, in gzipped NetCDF files. If I want to get all collocations that are nadir-looking, equatorial and have brightness temperatures in a specified range, for one year of data, I need to uncompress a .nc.gz file for every day of the year, read the data, select the data that I need, and go on. The files are stored on the server, it may take 10 minutes to select ten thousand measurements out of ten million in total. Most of the time is spent reading files from the network and uncompressing them. Reading about PyTables gives me the impression that it can be done better. Am I right? How? Suppose I would use PyTables, how would I best store my data? How large chunks should I store it in? One file per day? One file for everything? Something in-between? If I have one file for everything, it's less easy to take a few files "home" to experiment with. And if I want data only for a certain part of the year, I don't even need to look into the directories containing data for other periods. But to select equatorial measurements throughout the year, I need to uncompress every single file. What way to go? What are the pros and cons of storing larger or smaller chunks in one file? If it's best to store it in one big file, why does nobody in this community appear to be doing that? It's always one per orbit for level-1 and level-2, one per day or one per month for level-3 (gridded data)... are we all being inefficient, or am I missing something? Do all answers depend strongly on use case? Discussion or pointers to tutorials on the subject very welcome :) Gerrit. ------------------------------------------------------------------------------ Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users