Re: [sc-dev] Statistical Functions Implementation

Leonard Mada Wed, 15 Nov 2006 16:20:08 -0800

Hi Eike,

Eike Rathke wrote:

Please note that before we can integrate code or data contributed weneed a signed Joint Copyright Assignment form (JCA) filled-out, seehttp://contributing.openoffice.org/programming.html#jca

I sent today the completed JCA to [EMAIL PROTECTED] (I can alsoappend a copy of the JCA to this e-mail address if it is needed).

Furthermore, this isn't Fortran ;-) and it
would be much more eye-friendly if you fixed your CapsLock key and
refrained from using only capitalized letters in comments. Please use
normal capitalization instead. Thanks.

Btw, code is much more readable if you align the trailing comments (and

use proper capitalization, of course ;-)

Well, I do not know how the development environment of professionalprogrammers looks like. I am currently using (and wrote this codeinside) the free jEdit (http://sourceforge.net/projects/jedit/).

What I noted however is, that Calc lacks comments almost completely. ;-)It is quite difficult to work out what all the code does without havingguiding comments.

My comments look actually decent in jEdit. I like them best as they are.I definitely do *NOT* like lower case comments, because it gets meconfused: is it code or is it still comment. Obviously, you can NOT haveall the code written uppercase, so to distinguish comments from code Iwrite them in uppercase. And I feel that this way the code issubstantially more readable. (You only need to read comments once tounderstand what's going on, but you need to have the full view of thecode almost continuously.)

// WE GET EITHER A SINGLE MATRIX WHERE EVERY COLUMN IS A SEPARATE VARIABLE
//    DISADVANTAGE: ONLY ONE COLUMN PER VARIABLE
// OR MULTIPLE MATRICES, EACH MATRIX IS ONE VARIABLE
//    DISADVANTAGE:
//      CALC FUNCTIONS ACCEPT ONLY 30 PARAMS
//      SO THERE ARE AT MOST 30 VARIABLES


Not quite true. The UI in the Formula AutoPilot knows only 30 parameters
at maximum, the compiler and interpreter actually can handle more. ... But 
designing the parameters is more a question of how
other spreadsheet applications do it. We should follow that.


*gnumeric* : has 2 modes (actually 3: columns swapped for rows, too)
 1.) every column is one variable - my case 2: the one range scenario
 2.) every area is one variable: my case 1: multiple selection ranges

So gnumeric has it both ways.


*R*

R is complex. It is NOT graphically-oriented. Basically, you have only*ONE data vector* for such a simple ANOVA, BUT it contains *ALL* thevalues for *ALL* variables (which is counterintuitive for a novice).

A 2nd vector of the same length as the first vector, matches thosevalues to the variable they belong., e.g.

vector1 = (val1, val2, val3, val4, val5, ..., val100)
vector2 = (1, 1, 1, 1, 2, 2, ..., 10)

where val1-4 are data points for the first variable, val5-... are datapoints for variable 2, val...-100 are data points for variable 10.

Vector2 MUST also be a factor (and NOT a numeric vector). When youperform the ANOVA, you do a linear fit of vector1 on vector2:"aov(vector1 ~ vector2)". Quite complex for beginners.


Conclusion
=======
So, I feel that both methods MUST be present:

1. I find it usually more simple to have one range, so this optionshould exist definitely2. However, often the variables are scattered on the sheet, so you doNOT have a single selection, but multiple areas (gnumeric solved thisnice), AND therefore we need something that accepts and interpretsdiscontinuous selections.

For passing area references or arrays/matrices you may also want to take
a look ...

I actually saw some code for other functions like'ScInterpreter::ScPearson()' and imagined what all those functions doand so implemented a way to get the data based on that template.Unfortunately, I have NOT found a clear description what every functiondoes. Maybe there were better ways to implement it.

What I really need is either *one range* (for 2nd case) or *an array ofranges* (for 1st case).

        SCSIZE nR[iVarNr], nC[iVarNr];


This actually doesn't work. Automatically allocating variable-length
arrays on the stack is a GCC extension and doesn't work with all
compilers. Instead it should be ... = new xxx[];

Whenever I wrote some programs, I preferred to use vectors (or listobjects). NO need to manually destroy objects, NO ugly pointerarithmetic, NO memory allocation issues. All done cleanly. And you canincrease/decrease size dynamically and do NOT have to keep track of thechanges.

So, I virtually never wrote code using the 'new' operator. I hope thatsomebody experienced adjusts the code accordingly.

        double fValX[iVarNr] [nMAX]; // THE VALUES


I also don't see why we need  fValX[iVarNr][nMAX]  elements, where maybe
most of it will be unused if only one matrix has nMAX elements. I think
this can be much improved by *storing just the needed elements*.

Theoretically, YES. With vectors, no problem, increase size dynamically.With simple arrays, NO idea how to do that. We will know how many valueshave to be stored only after iterating through the matrix elements. Ihave a better solution, BUT:*How great is the cost of iterating two times through the range* vsstoring the data values during the first iteration? I have NO idea, butI presumed initially, that this will be very costly. I might have beenwrong.


This alternative would be: (in pseudocode)
- iterate through all ranges (these represent different variables)

-- iterate through elements of one range (this belong all to onevariable)

      // NEEDED to detect IF it is TRUE element
      // ALSO permits calculating
      --- calculate sum of elements;
      --- determine number of data values;
 -- END INNER LOOP
 -- mean[ith variable] = SUM / No of elements;
 -- No elements[ith variable] = No of elements;
 -- GrandMEAN += SUM;
 -- TotalNoOfElements += No of elements;
- END OUTER LOOP

// FOLLOWS 2nd Iteration
- iterate through all ranges (these represent different variables)

-- iterate through elements of one range (this belong all to onevariable)

      // NEEDED AGAIN to detect IF it is TRUE element
      --- calculate sum of residuals[i] += (Xi - mean[i]) * (Xi - mean[i]);
 -- END INNER LOOP
 -- fMSB += No elements[i] * (mean[i] - GrandMEAN) * (mean[i] - GrandMEAN)
- END OUTER LOOP

This code is even simpler and consumes far less memory, BUT as I sad, Ihave NO idea how much more slower it would be. [I believe it is slower.]

If you have an idea, how this code would fare, please tell me, I'minterested, too.


I'll reply to Niklas tomorrow, it is quite late now.

Sincerely,

Leonard Mada

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [sc-dev] Statistical Functions Implementation

Reply via email to