Welcome to OpenOffice.
It is really helpful if you already have worked with R. If you have some
experience in building a connection to R, things are even better.
Unfortunately, I cannot offer much coding advice, but if you have any
questions concerning statistics, feel free to ask.
In the following paragraphs, I will try to address some issues I feel
important when wanting to build a connection to R.
1. Calc should pass some data towards R
1.1 one group of data vs multiple data matrices (vectors)
1.2 Non-Available (NA) values
1.3 Column Labels, Data Frames, Factors
1.4 Other Parameters
2. Importing the results back into Calc
2.1 simple result (e.g. a p-value) vs R-object (e.g. a regression
object, bootstrap object)
2.2 Graphics
3. Which R-features should be implemented first
3.1 Non-parametric tests
3.2 ANOVA (?)
3.3 Regression: multivariate (lm), glm (especially logistic
regression), non-linear
3.4 bootstrap and other resampling procedures
1. Passing Data from Calc to R
=====================
1.1 The first thing to do is to device a method to read the data inside
Calc and pass it into R. This could be accomplished using a dialog box
that is accessed through a menu item.
- the data consists sometimes of a single group of data (single matrix,
or vector using R terminology)
- more often we have multiple data-groups (2 or more vectors)
IF we enter multiple groups of data, there are 2 distinct ways to access
them:
a.) fast and easy to access: IF every vector is limited to a spreadsheet
column AND the columns are contiguous, we can enter the merged data
matrix (e.g. A1:D100 for the 4 distinct groups A1:A100, B1:B100, C1:C100
and D1:D100);
b.) IF the previous method is non-applicable (non-contiguous, or more
than one column for a group), we need to enter every data group (vector)
separately: more cumbersome
COMPARATIVE ANALYSIS: gnumeric allows both of the previous methods (e.g.
for the ANOVA test)
1.2 - 1.3 SPECIAL SITUATIONS
Unfortunately, spreadsheet applications have some serious limitations
when it comes to statistical analysis.
1.2 MISSING VALUES
Calc does NOT offer a method to specify missing values. Therefore a
researcher may enter a "0" for a missing value, may leave the cell empty
or may enter a text string ("NA", or "-", or something else).
The parsing routine should therefore allow to change such data into R's
native "NA". Various statistics may be invalidated, IF missing values
are entered as a different data type (e.g. values of 0).
1.3. DATA FRAMES, FACTORS
Various statistical techniques need the data to be defined as a FACTOR
(e.g. when performing an ANOVA analysis in R). Therefore, the input
dialog should allow defining some of the data vectors as factors.
The data should be probably passed to R as a data FRAME for
multiple-group data. For tests that work on contingency tables (e.g.
Fisher exact test), the input routine needs to pass the data as a MATRIX.
IF passing the data as a frame, the input dialog should probably use the
'Column Labels' (IF defined) as the variable names (IF these are valid
names in R).
1.4. OTHER PARAMETERS
Depending on the test/ R-method desired, various other parameters may be
needed, e.g. for a regression one needs to specify the regression
formula [e.g. lm(y ~ x1 + x2 + x3)].
These were some thoughts on the input dialog. IF there are further
issues, feel free to ask.
I will address the remaining points in a later e-mail. Unfortunately, I
am always short of time, but I will try to help as much as I can. I have
discussed some of the statistical issues on the wiki-page
(http://wiki.services.openoffice.org/wiki/Statistical_Data_Analysis_Tool).
Feel free to read that information, too. (Further information on
graphical methods can be found at
http://wiki.services.openoffice.org/wiki/Chart2#Chart_Types).
Kind regards,
Leonard Mada
Wojciech Gryc wrote:
To Whom It May Concern:
Hi, I'm a student studying at the University of Toronto, in Canada,
and am
writing to follow up with the Google Summer of Code (SoC) ideas posted on
the OpenOffice.org wiki. I recently heard about the program for 2007, and
was extremely excited to see the wiki feature a discussion on building a
connection between Calc and R.
To provide a brief background, I am studying Applied Mathematics and
International Development Studies, and am interested in applying math,
statistics, and computational tools to the social sciences. As such, I've
been doing a great deal of work with statistics and am very interested in
trying to implement this connection between R and Calc. I've been using R
for about two years now, and am very familiar with OpenOffice. I'm
involved
in other statistical programming projects (lead coder for
www.egotistics.net,
a social network analysis tool, though please note this is in beta). I
also
have experience with SAS and SPSS, and have taken a number of statistics
courses (probability, intro to stats, linear regressions, quantitative
research courses, etc.). As such, I would be happy to tackle the
challenge
of integrating Calc and R during this upcoming summer.
With this in mind, I had a few questions about the potential project:
1. Is there anything that a potential applicant needs to discuss with
you (or others at OpenOffice.org) about such a project?
2. Is the R / Calc connection a priority for OpenOffice? (Personally,
I'd love to see such a connection, and I think many other people
would find
it useful as well.)
3. The wiki had very few details about what you seek in an applicant
and actual application, so I was wondering if you could provide more
information about the statistical or computer-based skills one is
expected
to have.
If you have any questions as well, please do let me know. I look
forward to
discussing the potential project, and am very excited. Feel free to
e-mail
me here or call me at +1-416-897-1344.
Thank you,
Wojciech Gryc
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]