I prototyped an approach last year that worked out well. I don't really know what to call it - maybe something like "property based persistence." It is kind of strange and I am not sure how broadly applicable it is - I have only used it for financial time series data.
I'll try to explain how the idea works. I start with a python object that has a number of properties and an associated large data set (in my case, financial instruments and their associated time series in the form of numpy arrays.) I then created infrastructure that allowed me to define a simple "mapper" function that used a subset of the object's properties to define a "path" (expressible in the same form either as a file system path or as a path in HDF to a table.) Then I persisted the bulky data set (again, time series in my case) at that location. This little piece of infrastructure is very lightweight and cuts the client side persistence code down to only the small "mapper" functions. The mapper functions don't actually build up paths - they just specify the properties and ordering that you want to use to build up the paths. It also makes querying very simple and fast because you don't really query at all - instead the properties associated with the query directly express the path at which the data is located. The drawback of this simplistic approach is that you need to add a second level of path addressing if you deal with datasets so large that you can not really persist them under a single path. If you have single multi GB or TB arrays you probably want to chunk things up a bit more in the style of GFS and its open source counterparts. I still have the python code for this properties based time series database. It is a very small and simple peice of code, but I am happy to give it a quick polish and open source it if anyone is interested in taking a look. I am also about to try this model using F# and db4o for a .Net project. On Wed, Dec 24, 2008 at 2:21 PM, Gael Varoquaux < [email protected]> wrote: > On Tue, Dec 23, 2008 at 02:10:50AM +0100, Olivier Grisel wrote: > > Interesting topic indeed. I think I have been hit with similar > problems on > > toy experimental scripts. So far the solution was always adhoc FS > caches > > of numpy arrays with manual filename management. Maybe the first step > for > > designing a generic solution would be to list some representative yet > > simple enough use cases with real sample python code so as to focus on > > concrete matters and avoid over engineering a general solution for > > philosophical problems. > > Yes, that's clearly a first ste: list the usecases, and the way we would > like it solved: think about the API. > > My internet connection is quite random currently, and I'll probably loose > it for a week any time soon. Do you want to start such a page on the > wiki. Mark it as a sratch page, and we'll delete it later. > > I should point out that joblib (on PyPI and launchpad) was a first > attempt to solve this problem, so you could have a look at it. I have > already identified things that are wrong with joblib (more on the API > side than actual bugs), so I know it is not a final solution. Figuring > out what was wrong only came from using it heavily in my work. I thing > the only way forward it to start something, use it, figure out what's > wrong, and start again... > > Looking forward to your input, > > Gaƫl > _______________________________________________ > Numpy-discussion mailing list > [email protected] > http://projects.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ Numpy-discussion mailing list [email protected] http://projects.scipy.org/mailman/listinfo/numpy-discussion
