Re: [Cython] Speeding up file io

Tyler Brough Mon, 09 Feb 2009 16:06:41 -0800

Thank you all for your input.  I have a lot to think about.

--Tyler


On Mon, Feb 9, 2009 at 4:29 PM, Glenn Tarbox, PhD <[email protected]> wrote:

>
>
> On Mon, Feb 9, 2009 at 12:16 PM, Robert Bradshaw <
> [email protected]> wrote:
>
>> On Feb 9, 2009, at 10:28 AM, Dag Sverre Seljebotn wrote:
>>
>> >> Tyler Brough wrote:
>> >  Hello All,
>> >>
>> >> I am new to Cython, though I heard about it long ago.  I have been
>> >> programming in python using scipy and pytables a lot. I am
>> >> building an
>> >> HDF5
>> >> database using pytables from extremely large files containing tick
>> >> by tick
>> >> stock data (one single day is several GBs). As you can imagine
>> >> this is
>> >> very
>> >> slow going, even trying to optimize the file io in python. What I am
>> >> wondering is will trying to do the file io in cython functions
>> >> speed up
>> >> the
>> >> process enough to make it worth while?  Can I simply access c FILE
>> >> pointers
>> >> directly in cython and pass them back as python objects?  Your
>> >> advice is
>> >> very much appreciated!
>> >
>> > A quick look on the pytables source reveal that they have plenty of C
>> > files there, and also some Pyrex and/or Cython files. The Pytable
>> > authors
>> > have likely thought of all the obvious ways to speed things up
>> > already...
>> >
>> > But to answer your question; from Cython you can call all the C IO
>> > functions directly, but a C FILE pointer cannot be passed into
>> > Python. You
>> > might be able to access the underlying handle of Python file
>> > objects if
>> > that's what you mean, you should look in the Python headers and
>> > CPython
>> > documentation for that.
>>
>> You can pass file objects around in Python, which are basically
>> wrappers around c FILE pointers (which can easily be extracted and
>> used directly from Cython code). As has already been mentioned, it is
>> unclear if this will provide speed gains (especially if your
>> application is io bound rather than cpu bound).
>
>
> The theme from Love Story keeps playing in my head as I read...
>
> First, Cython will provide substantial advantages throughout your project
> if you're dealing with tick data and python.  You're gonna have places where
> numpy / PyTables abstractions aren't sufficient so you'll need your own
> code.  If you're focusing on python, you need a mechanism to prototype and
> migrate into native code easily... Cython is the answer to that.
>
> However, having worked (and continue to work) with tick data, its not clear
> that  PyTables / HDF5 for the lowest levels of your application is the way
> to go.  I also started down that path and decided that I needed to go
> native...  But that really means "other stuff" that works with the usual
> Python subjects.
>
> The real issue with tick data is: "Everything is Huge".  Python and the
> library abstractions provided by Numpy, HDF5, PyTables etc are generally
> fine if you fit perfectly inside.  For most, when it isn't perfect, a few
> lines of python and all is right with the world...
>
> not so much for you...
>
> You should take a quick look at the finance package in Sage.  Inside,
> you'll find small set of timeseries classes with support for related
> operations written by William Stein.  William and I had spent some time
> discussing the issues with tick-data and he was able to produce much of the
> core capability in a few days.  That alone should speak volumes about Cython
> (although William is, um, lets say "more productive" than most... to put it
> mildly).
>
> When William wrote those classes, he determined that the abstractions
> afforded by Numpy were inappropriate for time-series due to Numpy's
> generality, which manifests itself at run time (in compiled C code) with a
> number of calculations / logic to make Numpy do Numpy things.  When you're
> working with a huge block of tick data, this overhead is gonna hurt.
>
> So, his approach was to facilitate conversion between regular python lists,
> numpy arrays, and the tick data structure which is a 1d block of doubles.
> (malloc...)
>
> The key to Python (really C-Python) is that it was written to be glue over
> C.  Cython is many things, but amongst them is making the language
> transition simpler by hiding the details of integrating with the Python
> interpreter (you certainly know this).
>
> So, I think you're gonna find that you're better off "going native" at the
> tick level because at the Hugeness issue.  You'll likely save development
> and certainly execution time with low level code specific to your problem
> (maybe mmap'ed blocks of doubles to address the file issues discussed above)
>
> Cython makes this a tolerable experience.
>
> -glenn
>
>
>>
>>
>> - Robert
>>
>> _______________________________________________
>> Cython-dev mailing list
>> [email protected]
>> http://codespeak.net/mailman/listinfo/cython-dev
>>
>
>
>
> --
> *** Note: New Number
> Glenn H. Tarbox, PhD ||  206-274-6919
>
> _______________________________________________
> Cython-dev mailing list
> [email protected]
> http://codespeak.net/mailman/listinfo/cython-dev
>
>

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Speeding up file io

Reply via email to