Hi Gerrit, You are asking a lot of questions that requires a lot of experience, not only to explain, but also to understand correctly. My first advice is that, if you really want to grasp what PyTables can do for you, then you should start getting used to PyTables with some code (for example, do the tutorials in the User's Manual).
Regarding performance and compression, one of the strengths of PyTables/HDF5 is that you can use chunked datasets. This allows to easily extend potentially multidimensional datasets but, most specially, apply arbitrary filters to your datasets. Filters are one of the most powerful weaponry behind HDF5 --for example, it lets you using compression seamlessly in big datasets without the need to decompress them before. And you can also use other filters with your data, not only compression (see the shuffle filter, for example). But as I said, if you want to learn, you will have to do your homework first. Using PyTables for saving some NumPy arrays is very easy. Getting the most out of PyTables, as everything in life, is not that easy --although reading the "Optimization tips" chapter of User's Guide may help to get you into the correct track more quickly. One last piece of advise: start doing some basics with PyTables and then ask for more advise with more focused questions. I'm sure that you will find here people that can teach you nice things about this technology; you just have to help them to help you ;-) Good luck, Francesc A Monday 18 October 2010 16:38:39 Gerrit Holl escrigué: > Hi, > > This is not really a PyTables question, but I think there is > expertise here and I will probably be using PyTables, so I ask > anyway. > > I am a PhD student in atmospheric remote sensing. I work with data > analysis and work with quite large amounts of data. Until now, I have > been using Matlab, but I'm getting more and more fed-up with it, so I > want to look back at my childhood love, Python. > > My work consists of collocating different satellite datasets and then > using the collocations for various purposes. For example, I collocate > level 2B data from the CloudSat Cloud Profiling Radar with radiances > from the Atmospheric Microwave Sounding Unit and the Microwave > Humidity Sounder; those interested can find more information in: > Holl, G., Buehler, S. A., Rydberg, B., and Jiménez, C.: Collocating > satellite-based radar and radiometer measurements – methodology and > usage examples, Atmos. Meas. Tech., 3, 693-708, > doi:10.5194/amt-3-693-2010, 2010. Available online at > http://www.atmos-meas-tech.net/3/693/2010/amt-3-693-2010.html > > As a side effect of deciding to look into Python, I came across > pytables and started thinking about strategies of data storage. In > earth observation, measurements are usually stored per orbit. I store > my collocations per day, in gzipped NetCDF files. If I want to get > all collocations that are nadir-looking, equatorial and have > brightness temperatures in a specified range, for one year of data, > I need to uncompress a .nc.gz file for every day of the year, read > the data, select the data that I need, and go on. The files are > stored on the server, it may take 10 minutes to select ten thousand > measurements out of ten million in total. Most of the time is spent > reading files from the network and uncompressing them. > > Reading about PyTables gives me the impression that it can be done > better. Am I right? How? Suppose I would use PyTables, how would I > best store my data? How large chunks should I store it in? One file > per day? One file for everything? Something in-between? If I have one > file for everything, it's less easy to take a few files "home" to > experiment with. And if I want data only for a certain part of the > year, I don't even need to look into the directories containing data > for other periods. But to select equatorial measurements throughout > the year, I need to uncompress every single file. What way to go? > What are the pros and cons of storing larger or smaller chunks in > one file? If it's best to store it in one big file, why does nobody > in this community appear to be doing that? It's always one per orbit > for level-1 and level-2, one per day or one per month for level-3 > (gridded data)... are we all being inefficient, or am I missing > something? Do all answers depend strongly on use case? > > Discussion or pointers to tutorials on the subject very welcome :) > > Gerrit. > > --------------------------------------------------------------------- > --------- Download new Adobe(R) Flash(R) Builder(TM) 4 > The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly > Flex(R) Builder(TM)) enable the development of rich applications that > run across multiple browsers and platforms. Download your free > trials today! http://p.sf.net/sfu/adobe-dev2dev > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted ------------------------------------------------------------------------------ Download new Adobe(R) Flash(R) Builder(TM) 4 The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly Flex(R) Builder(TM)) enable the development of rich applications that run across multiple browsers and platforms. Download your free trials today! http://p.sf.net/sfu/adobe-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users