Re: [Cython] Speeding up file io

Glenn Tarbox, PhD Mon, 09 Feb 2009 15:29:43 -0800

On Mon, Feb 9, 2009 at 12:16 PM, Robert Bradshaw <
[email protected]> wrote:

> On Feb 9, 2009, at 10:28 AM, Dag Sverre Seljebotn wrote:
>
> >> Tyler Brough wrote:
> >  Hello All,
> >>
> >> I am new to Cython, though I heard about it long ago.  I have been
> >> programming in python using scipy and pytables a lot. I am
> >> building an
> >> HDF5
> >> database using pytables from extremely large files containing tick
> >> by tick
> >> stock data (one single day is several GBs). As you can imagine
> >> this is
> >> very
> >> slow going, even trying to optimize the file io in python. What I am
> >> wondering is will trying to do the file io in cython functions
> >> speed up
> >> the
> >> process enough to make it worth while?  Can I simply access c FILE
> >> pointers
> >> directly in cython and pass them back as python objects?  Your
> >> advice is
> >> very much appreciated!
> >
> > A quick look on the pytables source reveal that they have plenty of C
> > files there, and also some Pyrex and/or Cython files. The Pytable
> > authors
> > have likely thought of all the obvious ways to speed things up
> > already...
> >
> > But to answer your question; from Cython you can call all the C IO
> > functions directly, but a C FILE pointer cannot be passed into
> > Python. You
> > might be able to access the underlying handle of Python file
> > objects if
> > that's what you mean, you should look in the Python headers and
> > CPython
> > documentation for that.
>
> You can pass file objects around in Python, which are basically
> wrappers around c FILE pointers (which can easily be extracted and
> used directly from Cython code). As has already been mentioned, it is
> unclear if this will provide speed gains (especially if your
> application is io bound rather than cpu bound).

The theme from Love Story keeps playing in my head as I read...

First, Cython will provide substantial advantages throughout your project if
you're dealing with tick data and python.  You're gonna have places where
numpy / PyTables abstractions aren't sufficient so you'll need your own
code.  If you're focusing on python, you need a mechanism to prototype and
migrate into native code easily... Cython is the answer to that.

However, having worked (and continue to work) with tick data, its not clear
that  PyTables / HDF5 for the lowest levels of your application is the way
to go.  I also started down that path and decided that I needed to go
native...  But that really means "other stuff" that works with the usual
Python subjects.

The real issue with tick data is: "Everything is Huge".  Python and the
library abstractions provided by Numpy, HDF5, PyTables etc are generally
fine if you fit perfectly inside.  For most, when it isn't perfect, a few
lines of python and all is right with the world...

not so much for you...

You should take a quick look at the finance package in Sage.  Inside, you'll
find small set of timeseries classes with support for related operations
written by William Stein.  William and I had spent some time discussing the
issues with tick-data and he was able to produce much of the core capability
in a few days.  That alone should speak volumes about Cython (although
William is, um, lets say "more productive" than most... to put it mildly).

When William wrote those classes, he determined that the abstractions
afforded by Numpy were inappropriate for time-series due to Numpy's
generality, which manifests itself at run time (in compiled C code) with a
number of calculations / logic to make Numpy do Numpy things.  When you're
working with a huge block of tick data, this overhead is gonna hurt.

So, his approach was to facilitate conversion between regular python lists,
numpy arrays, and the tick data structure which is a 1d block of doubles.
(malloc...)

The key to Python (really C-Python) is that it was written to be glue over
C.  Cython is many things, but amongst them is making the language
transition simpler by hiding the details of integrating with the Python
interpreter (you certainly know this).

So, I think you're gonna find that you're better off "going native" at the
tick level because at the Hugeness issue.  You'll likely save development
and certainly execution time with low level code specific to your problem
(maybe mmap'ed blocks of doubles to address the file issues discussed above)

Cython makes this a tolerable experience.

-glenn

>
>
> - Robert
>
> _______________________________________________
> Cython-dev mailing list
> [email protected]
> http://codespeak.net/mailman/listinfo/cython-dev
>

-- 
*** Note: New Number
Glenn H. Tarbox, PhD ||  206-274-6919

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Speeding up file io

Reply via email to