On Mon, Oct 19, 2015 at 9:51 PM, <josef.p...@gmail.com> wrote:

>
>
> On Mon, Oct 19, 2015 at 9:15 PM, Nathaniel Smith <n...@pobox.com> wrote:
>
>> On Mon, Oct 19, 2015 at 5:55 AM,  <josef.p...@gmail.com> wrote:
>> >
>> >
>> > On Mon, Oct 19, 2015 at 2:14 AM, Nathaniel Smith <n...@pobox.com> wrote:
>> >>
>> >> On Sun, Oct 18, 2015 at 9:35 PM,  <josef.p...@gmail.com> wrote:
>> >> >>>> np.column_stack((np.ones(10), np.ones(10))).flags
>> >> >   C_CONTIGUOUS : True
>> >> >   F_CONTIGUOUS : False
>> >> >
>> >> >>>> np.__version__
>> >> > '1.9.2rc1'
>> >> >
>> >> >
>> >> > on my notebook which has numpy 1.6.1 it is f_contiguous
>> >> >
>> >> >
>> >> > I was just trying to optimize a loop over variable adjustment in
>> >> > regression,
>> >> > and found out that we lost fortran contiguity.
>> >> >
>> >> > I always thought column_stack is for fortran usage (linalg)
>> >> >
>> >> > What's the alternative?
>> >> > column_stack was one of my favorite commands, and I always assumed we
>> >> > have
>> >> > in statsmodels the right memory layout to call the linalg libraries.
>> >> >
>> >> > ("assumed" means we don't have timing nor unit tests for it.)
>> >>
>> >> In general practice no numpy functions make any guarantee about memory
>> >> layout, unless that's explicitly a documented part of their contract
>> >> (e.g. 'ascontiguous', or some functions that take an order= argument
>> >> -- I say "some" b/c there are functions like 'reshape' that take an
>> >> argument called order= that doesn't actually refer to memory layout).
>> >> This isn't so much an official policy as just a fact of life -- if
>> >> no-one has any idea that the someone is depending on some memory
>> >> layout detail then there's no way to realize that we've broken
>> >> something. (But it is a good policy IMO.)
>> >
>> >
>> > I understand that in general.
>> >
>> > However, I always thought column_stack is a array creation function
>> which
>> > have guaranteed memory layout. And since it's stacking by columns I
>> thought
>> > that order is always Fortran.
>> > And the fact that it doesn't have an order keyword yet, I thought is
>> just a
>> > missing extension.
>>
>> I guess I don't know what to say except that I'm sorry to hear that
>> and sorry that no-one noticed until several releases later.
>>
>
>
> Were there more contiguity changes in 0.10?
> I just saw a large number of test errors and failures in statespace models
> which are heavily based on cython code where it's not just a question of
> performance.
>
> I don't know yet what's going on, but I just saw that we have some
> explicit tests for fortran contiguity which just started to fail.
>
>
>
>
>>
>> >> If this kind of problem gets caught during a pre-release cycle then we
>> >> generally do try to fix it, because we try not to break code, but if
>> >> it's been broken for 2 full releases then there's no much we can do --
>> >> we can't go back in time to fix it so it sounds like you're stuck
>> >> working around the problem no matter what (unless you want to refuse
>> >> to support 1.9.0 through 1.10.1, which I assume you don't... worst
>> >> case, you just have to do a global search replace of np.column_stack
>> >> with statsmodels.utils.column_stack_f, right?).
>> >>
>> >> And the regression issue seems like the only real argument for
>> >> changing it back -- we'd never guarantee f-contiguity here if starting
>> >> from a blank slate, I think?
>> >
>> >
>> > When the cat is out of the bag, the down stream developer writes
>> > compatibility code or helper functions.
>> >
>> > I will do that at at least the parts I know are intentionally designed
>> for F
>> > memory order.
>> >
>> > ---
>> >
>> > statsmodels doesn't really check or consistently optimize the memory
>> order,
>> > except in some cython functions.
>> > But, I thought we should be doing quite well with getting Fortran
>> ordered
>> > arrays. I only paid attention where we have more extensive loops
>> internally.
>> >
>> > Nathniel, Does patsy guarantee memory layout (F-contiguous) when
>> creating
>> > design matrices?
>>
>> I never thought about it :-). So: no, it looks like right now patsy
>> usually returns C-order matrices (or really, whatever np.empty or
>> np.repeat returns), and there aren't any particular guarantees that
>> this will continue to be the case in the future.
>>
>> Is returning matrices in F-contiguous layout really important? Should
>> there be a return_type="fortran_matrix" option or something like that?
>>
>
> I don't know, yet. My intuition was that it would be better because we
> feed the arrays directly to pinv/SVD or QR which, I think, require by
> default Fortran contiguous.
>
> However, my intuition might not be correct, and it might not make much
> difference in a single OLS estimation.
>

I did some quick timing checks of pinv and qr, and the Fortran ordered is
only about 5% to 15% faster and uses about the same amount of memory
(watching the Task manager). So, nothing to get excited about.

I used functions like this where xf is either C or Fortran contiguous

def funcl_f():
    for k in range(3, xf.shape[1]):
        np.linalg.pinv(xf[:,:k])


with MKL which looks, to my surprise, multithreaded.
(It used 50% CPU on my Quadcore, single processor is 13% CPU)

BTW: default scipy.linalg.qr is pretty bad, 9GB peak memory instead of
200MB in mode='economic'

Josef



>
> There are a few critical loops in variable selection that I'm planning to
> investigate to see how much it matters.
> Memory optimization was never high in our priority compared to expanding
> the functionality overall, but reading the Julia mailing list is starting
> to worry me a bit. :)
>
> (I'm even starting to see the reason for multiple dispatch.)
>
> Josef
>
>
>>
>> -n
>>
>> --
>> Nathaniel J. Smith -- http://vorpus.org
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to