On Mon, Oct 19, 2015 at 9:51 PM, <josef.p...@gmail.com> wrote: > > > On Mon, Oct 19, 2015 at 9:15 PM, Nathaniel Smith <n...@pobox.com> wrote: > >> On Mon, Oct 19, 2015 at 5:55 AM, <josef.p...@gmail.com> wrote: >> > >> > >> > On Mon, Oct 19, 2015 at 2:14 AM, Nathaniel Smith <n...@pobox.com> wrote: >> >> >> >> On Sun, Oct 18, 2015 at 9:35 PM, <josef.p...@gmail.com> wrote: >> >> >>>> np.column_stack((np.ones(10), np.ones(10))).flags >> >> > C_CONTIGUOUS : True >> >> > F_CONTIGUOUS : False >> >> > >> >> >>>> np.__version__ >> >> > '1.9.2rc1' >> >> > >> >> > >> >> > on my notebook which has numpy 1.6.1 it is f_contiguous >> >> > >> >> > >> >> > I was just trying to optimize a loop over variable adjustment in >> >> > regression, >> >> > and found out that we lost fortran contiguity. >> >> > >> >> > I always thought column_stack is for fortran usage (linalg) >> >> > >> >> > What's the alternative? >> >> > column_stack was one of my favorite commands, and I always assumed we >> >> > have >> >> > in statsmodels the right memory layout to call the linalg libraries. >> >> > >> >> > ("assumed" means we don't have timing nor unit tests for it.) >> >> >> >> In general practice no numpy functions make any guarantee about memory >> >> layout, unless that's explicitly a documented part of their contract >> >> (e.g. 'ascontiguous', or some functions that take an order= argument >> >> -- I say "some" b/c there are functions like 'reshape' that take an >> >> argument called order= that doesn't actually refer to memory layout). >> >> This isn't so much an official policy as just a fact of life -- if >> >> no-one has any idea that the someone is depending on some memory >> >> layout detail then there's no way to realize that we've broken >> >> something. (But it is a good policy IMO.) >> > >> > >> > I understand that in general. >> > >> > However, I always thought column_stack is a array creation function >> which >> > have guaranteed memory layout. And since it's stacking by columns I >> thought >> > that order is always Fortran. >> > And the fact that it doesn't have an order keyword yet, I thought is >> just a >> > missing extension. >> >> I guess I don't know what to say except that I'm sorry to hear that >> and sorry that no-one noticed until several releases later. >> > > > Were there more contiguity changes in 0.10? > I just saw a large number of test errors and failures in statespace models > which are heavily based on cython code where it's not just a question of > performance. > > I don't know yet what's going on, but I just saw that we have some > explicit tests for fortran contiguity which just started to fail. > > > > >> >> >> If this kind of problem gets caught during a pre-release cycle then we >> >> generally do try to fix it, because we try not to break code, but if >> >> it's been broken for 2 full releases then there's no much we can do -- >> >> we can't go back in time to fix it so it sounds like you're stuck >> >> working around the problem no matter what (unless you want to refuse >> >> to support 1.9.0 through 1.10.1, which I assume you don't... worst >> >> case, you just have to do a global search replace of np.column_stack >> >> with statsmodels.utils.column_stack_f, right?). >> >> >> >> And the regression issue seems like the only real argument for >> >> changing it back -- we'd never guarantee f-contiguity here if starting >> >> from a blank slate, I think? >> > >> > >> > When the cat is out of the bag, the down stream developer writes >> > compatibility code or helper functions. >> > >> > I will do that at at least the parts I know are intentionally designed >> for F >> > memory order. >> > >> > --- >> > >> > statsmodels doesn't really check or consistently optimize the memory >> order, >> > except in some cython functions. >> > But, I thought we should be doing quite well with getting Fortran >> ordered >> > arrays. I only paid attention where we have more extensive loops >> internally. >> > >> > Nathniel, Does patsy guarantee memory layout (F-contiguous) when >> creating >> > design matrices? >> >> I never thought about it :-). So: no, it looks like right now patsy >> usually returns C-order matrices (or really, whatever np.empty or >> np.repeat returns), and there aren't any particular guarantees that >> this will continue to be the case in the future. >> >> Is returning matrices in F-contiguous layout really important? Should >> there be a return_type="fortran_matrix" option or something like that? >> > > I don't know, yet. My intuition was that it would be better because we > feed the arrays directly to pinv/SVD or QR which, I think, require by > default Fortran contiguous. > > However, my intuition might not be correct, and it might not make much > difference in a single OLS estimation. >
I did some quick timing checks of pinv and qr, and the Fortran ordered is only about 5% to 15% faster and uses about the same amount of memory (watching the Task manager). So, nothing to get excited about. I used functions like this where xf is either C or Fortran contiguous def funcl_f(): for k in range(3, xf.shape[1]): np.linalg.pinv(xf[:,:k]) with MKL which looks, to my surprise, multithreaded. (It used 50% CPU on my Quadcore, single processor is 13% CPU) BTW: default scipy.linalg.qr is pretty bad, 9GB peak memory instead of 200MB in mode='economic' Josef > > There are a few critical loops in variable selection that I'm planning to > investigate to see how much it matters. > Memory optimization was never high in our priority compared to expanding > the functionality overall, but reading the Julia mailing list is starting > to worry me a bit. :) > > (I'm even starting to see the reason for multiple dispatch.) > > Josef > > >> >> -n >> >> -- >> Nathaniel J. Smith -- http://vorpus.org >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion