Welcome to OpenOffice.

It is really helpful if you already have worked with R. If you have some experience in building a connection to R, things are even better.

Unfortunately, I cannot offer much coding advice, but if you have any questions concerning statistics, feel free to ask.

In the following paragraphs, I will try to address some issues I feel important when wanting to build a connection to R.

1. Calc should pass some data towards R
 1.1 one group of data vs multiple data matrices (vectors)
 1.2 Non-Available (NA) values
 1.3 Column Labels, Data Frames, Factors
 1.4 Other Parameters
2. Importing the results back into Calc
2.1 simple result (e.g. a p-value) vs R-object (e.g. a regression object, bootstrap object)
 2.2 Graphics
3. Which R-features should be implemented first
 3.1 Non-parametric tests
 3.2 ANOVA (?)
3.3 Regression: multivariate (lm), glm (especially logistic regression), non-linear
 3.4 bootstrap and other resampling procedures

1. Passing Data from Calc to R
=====================

1.1 The first thing to do is to device a method to read the data inside Calc and pass it into R. This could be accomplished using a dialog box that is accessed through a menu item.

- the data consists sometimes of a single group of data (single matrix, or vector using R terminology)
- more often we have multiple data-groups (2 or more vectors)

IF we enter multiple groups of data, there are 2 distinct ways to access them: a.) fast and easy to access: IF every vector is limited to a spreadsheet column AND the columns are contiguous, we can enter the merged data matrix (e.g. A1:D100 for the 4 distinct groups A1:A100, B1:B100, C1:C100 and D1:D100); b.) IF the previous method is non-applicable (non-contiguous, or more than one column for a group), we need to enter every data group (vector) separately: more cumbersome

COMPARATIVE ANALYSIS: gnumeric allows both of the previous methods (e.g. for the ANOVA test)

1.2 - 1.3 SPECIAL SITUATIONS
Unfortunately, spreadsheet applications have some serious limitations when it comes to statistical analysis.

1.2 MISSING VALUES
Calc does NOT offer a method to specify missing values. Therefore a researcher may enter a "0" for a missing value, may leave the cell empty or may enter a text string ("NA", or "-", or something else).

The parsing routine should therefore allow to change such data into R's native "NA". Various statistics may be invalidated, IF missing values are entered as a different data type (e.g. values of 0).

1.3. DATA FRAMES, FACTORS
Various statistical techniques need the data to be defined as a FACTOR (e.g. when performing an ANOVA analysis in R). Therefore, the input dialog should allow defining some of the data vectors as factors.

The data should be probably passed to R as a data FRAME for multiple-group data. For tests that work on contingency tables (e.g. Fisher exact test), the input routine needs to pass the data as a MATRIX.

IF passing the data as a frame, the input dialog should probably use the 'Column Labels' (IF defined) as the variable names (IF these are valid names in R).

1.4. OTHER PARAMETERS
Depending on the test/ R-method desired, various other parameters may be needed, e.g. for a regression one needs to specify the regression formula [e.g. lm(y ~ x1 + x2 + x3)].

These were some thoughts on the input dialog. IF there are further issues, feel free to ask.

I will address the remaining points in a later e-mail. Unfortunately, I am always short of time, but I will try to help as much as I can. I have discussed some of the statistical issues on the wiki-page (http://wiki.services.openoffice.org/wiki/Statistical_Data_Analysis_Tool). Feel free to read that information, too. (Further information on graphical methods can be found at http://wiki.services.openoffice.org/wiki/Chart2#Chart_Types).

Kind regards,

Leonard Mada


Wojciech Gryc wrote:
To Whom It May Concern:

Hi, I'm a student studying at the University of Toronto, in Canada, and am
writing to follow up with the Google Summer of Code (SoC) ideas posted on
the OpenOffice.org wiki. I recently heard about the program for 2007, and
was extremely excited to see the wiki feature a discussion on building a
connection between Calc and R.

To provide a brief background, I am studying Applied Mathematics and
International Development Studies, and am interested in applying math,
statistics, and computational tools to the social sciences. As such, I've
been doing a great deal of work with statistics and am very interested in
trying to implement this connection between R and Calc. I've been using R
for about two years now, and am very familiar with OpenOffice. I'm involved in other statistical programming projects (lead coder for www.egotistics.net, a social network analysis tool, though please note this is in beta). I also
have experience with SAS and SPSS, and have taken a number of statistics
courses (probability, intro to stats, linear regressions, quantitative
research courses, etc.). As such, I would be happy to tackle the challenge
of integrating Calc and R during this upcoming summer.

With this in mind, I had a few questions about the potential project:

  1. Is there anything that a potential applicant needs to discuss with
  you (or others at OpenOffice.org) about such a project?
  2. Is the R / Calc connection a priority for OpenOffice? (Personally,
I'd love to see such a connection, and I think many other people would find
  it useful as well.)
  3. The wiki had very few details about what you seek in an applicant
  and actual application, so I was wondering if you could provide more
information about the statistical or computer-based skills one is expected
  to have.

If you have any questions as well, please do let me know. I look forward to discussing the potential project, and am very excited. Feel free to e-mail
me here or call me at +1-416-897-1344.

Thank you,
Wojciech Gryc


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to