Guido van Rossum writes: > On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin > <oscar.j.benja...@gmail.com> wrote: > > On 8 September 2013 18:32, Guido van Rossum <gu...@python.org> wrote: > >> Going over the open issues: > >> > >> - Parallel arrays or arrays of tuples? I think the API should require > >> an array of tuples. It is trivial to zip up parallel arrays to the > >> required format, while if you have an array of tuples, extracting the > >> parallel arrays is slightly more cumbersome. > >> > >> Also for manipulating of the raw data, an array of tuples makes > >> it easier to do insertions or removals without worrying about > >> losing the correspondence between the arrays.
I don't necessarily find this persuasive. It's more common when working with existing databases that you add variables than add observations. This is going to require attention to the correspondence in any case. Observations aren't added, and they're "removed" temporarily for statistics on subsets by slicing. If you use the same slice for all variables, you're not going to make a mistake. > Not really. The implementation may change, or its needs may not be > obvious to the caller. I would say the right thing to do is request > something easy to remember, which often means consistent. In general, > Python APIs definitely skew towards lists of tuples rather than > parallel arrays, and for good reasons -- that way you benefit most > from built-in operations like slices and insert/append. However, it's common in economic statistics to have a rectangular array, and extract both certain rows (tuples of observations on variables) and certain columns (variables). For example you might have data on populations of American states from 1900 to 2012, and extract the data on New England states from 1946 to 2012 for analysis. > The one argument I *haven't* heard yet which *might* sway me would be > something along the line "every other statistics package that users > might be familiar with does it this way" or "all the statistics > textbooks do it this way". (Because, frankly, when it comes to > statistics I'm a rank amateur and I really want Steven's new module to > educate me as much as help me compute specific statistical functions.) In economic statistics, most software traditionally inputs variables in column-major order (ie, parallel arrays). That said, most software nowadays allows input as spreadsheet tables. You pays your money and you takes your choice. I think the example above of state population data shows that rows and columns are pretty symmetric here. Many databases will have "too many" of both, and you'll want to "slice" both to get the sample and variables relevant to your analysis. This is all just for consideration; I am quite familiar with economic statistics and software, but not so much for that used in sociology, psychology, and medical applications. In the end, I think it's best to leave it up to Steven's judgment as to what is convenient for him to maintain. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com