Thank you all for your input. I have a lot to think about. --Tyler
On Mon, Feb 9, 2009 at 4:29 PM, Glenn Tarbox, PhD <[email protected]> wrote: > > > On Mon, Feb 9, 2009 at 12:16 PM, Robert Bradshaw < > [email protected]> wrote: > >> On Feb 9, 2009, at 10:28 AM, Dag Sverre Seljebotn wrote: >> >> >> Tyler Brough wrote: >> > Hello All, >> >> >> >> I am new to Cython, though I heard about it long ago. I have been >> >> programming in python using scipy and pytables a lot. I am >> >> building an >> >> HDF5 >> >> database using pytables from extremely large files containing tick >> >> by tick >> >> stock data (one single day is several GBs). As you can imagine >> >> this is >> >> very >> >> slow going, even trying to optimize the file io in python. What I am >> >> wondering is will trying to do the file io in cython functions >> >> speed up >> >> the >> >> process enough to make it worth while? Can I simply access c FILE >> >> pointers >> >> directly in cython and pass them back as python objects? Your >> >> advice is >> >> very much appreciated! >> > >> > A quick look on the pytables source reveal that they have plenty of C >> > files there, and also some Pyrex and/or Cython files. The Pytable >> > authors >> > have likely thought of all the obvious ways to speed things up >> > already... >> > >> > But to answer your question; from Cython you can call all the C IO >> > functions directly, but a C FILE pointer cannot be passed into >> > Python. You >> > might be able to access the underlying handle of Python file >> > objects if >> > that's what you mean, you should look in the Python headers and >> > CPython >> > documentation for that. >> >> You can pass file objects around in Python, which are basically >> wrappers around c FILE pointers (which can easily be extracted and >> used directly from Cython code). As has already been mentioned, it is >> unclear if this will provide speed gains (especially if your >> application is io bound rather than cpu bound). > > > The theme from Love Story keeps playing in my head as I read... > > First, Cython will provide substantial advantages throughout your project > if you're dealing with tick data and python. You're gonna have places where > numpy / PyTables abstractions aren't sufficient so you'll need your own > code. If you're focusing on python, you need a mechanism to prototype and > migrate into native code easily... Cython is the answer to that. > > However, having worked (and continue to work) with tick data, its not clear > that PyTables / HDF5 for the lowest levels of your application is the way > to go. I also started down that path and decided that I needed to go > native... But that really means "other stuff" that works with the usual > Python subjects. > > The real issue with tick data is: "Everything is Huge". Python and the > library abstractions provided by Numpy, HDF5, PyTables etc are generally > fine if you fit perfectly inside. For most, when it isn't perfect, a few > lines of python and all is right with the world... > > not so much for you... > > You should take a quick look at the finance package in Sage. Inside, > you'll find small set of timeseries classes with support for related > operations written by William Stein. William and I had spent some time > discussing the issues with tick-data and he was able to produce much of the > core capability in a few days. That alone should speak volumes about Cython > (although William is, um, lets say "more productive" than most... to put it > mildly). > > When William wrote those classes, he determined that the abstractions > afforded by Numpy were inappropriate for time-series due to Numpy's > generality, which manifests itself at run time (in compiled C code) with a > number of calculations / logic to make Numpy do Numpy things. When you're > working with a huge block of tick data, this overhead is gonna hurt. > > So, his approach was to facilitate conversion between regular python lists, > numpy arrays, and the tick data structure which is a 1d block of doubles. > (malloc...) > > The key to Python (really C-Python) is that it was written to be glue over > C. Cython is many things, but amongst them is making the language > transition simpler by hiding the details of integrating with the Python > interpreter (you certainly know this). > > So, I think you're gonna find that you're better off "going native" at the > tick level because at the Hugeness issue. You'll likely save development > and certainly execution time with low level code specific to your problem > (maybe mmap'ed blocks of doubles to address the file issues discussed above) > > Cython makes this a tolerable experience. > > -glenn > > >> >> >> - Robert >> >> _______________________________________________ >> Cython-dev mailing list >> [email protected] >> http://codespeak.net/mailman/listinfo/cython-dev >> > > > > -- > *** Note: New Number > Glenn H. Tarbox, PhD || 206-274-6919 > > _______________________________________________ > Cython-dev mailing list > [email protected] > http://codespeak.net/mailman/listinfo/cython-dev > >
_______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
