Hi Leonard,
Thanks very much for your thoughts. First, if you want to try the GUI
package it's all up now on the wiki (
http://wiki.services.openoffice.org/wiki/R_and_Calc).
You raise a lot of good points for missing values. Right now, I only deal
with empty cells -- before a value is retrieved from a Calc cell, the
software checks to see if there is anything written in the cell. If not,
"NA" is inserted, so we avoid the problem of zeros.
That being said, I didn't account for other symbols. I'm trying to see if it
would be possible to create an external file where some of these settings
are stored, so a user can select which values are represented as missing
values... Although would it be safe to say that if a cell has a non-numeric
value, that we're dealing with a missing value?
In terms of column lengths, the package (for now) requires a length to be
set explicitly -- you would pass A1:A20 rather than the entire column in A.
I'll keep you updated with regards to where this goes.
Thanks,
Wojciech
On 5/22/07, Leonard Mada <[EMAIL PROTECTED]> wrote:
Hi,
I will look into the GUI-scripting more extensively tomorrow. Until
then, I have some comments on 'missing values'.
Missing values are a major problem in statistical analysis. Because this
topic has such great importance, I will discuss it in greater detail.
There are multiple ways to represent missing values. Researchers will
often use various representations for such missing values.
Unfortunately, spreadsheet applications do NOT have such a data-type,
hampering any standardised representation.
The most often used representations for missing values are: 'an empty
cell' (or, sometimes, erroneously, a space), 'NA', '-', '--', '0'. These
are by far not all possible combinations.
Therefore, any serious statistical analysis starts with marking such
values as "missing values". Fortunately, R comes with some handsome
functions which allow automatic conversion of these values.
1.
R-packages
2.
Calc-issues
1. R-PACKAGES
=============
Especially useful is package 'gdata'. The latest R-newsletter (Volume
7/1, April 2007, see
http://cran.r-project.org/doc/Rnews/Rnews_2007-1.pdf) describes this
package (Working with Unknown Values, p 24).
Especially useful are the following functions, which I will shortly
describe next.
*
function isUnknown( data_vector, missing_values_vector )
*
function unknownToNA( data_vector, missing_values_vector )
Suppose, we have the following vector:
data<- c(0,32,24,35,36,42,37,45,55,39,49,NA,"-")
Obviously, the values "0", "NA" and "-" are probably missing values. To
get rid of them, we may use the functions 'isUnknown' and 'unknownToNA'
from package 'gdata':
isUnknown(x = data, unknown = c(0, NA, "-") ) returns "TRUE" for the 3
missing values.
data.corrected <- unknownToNA(x = data, unknown = c(0, NA, "-") ), sets
the missing values to NA (NOTE: it is not necessary to explicitly
specify 'NA').
[Because the 'data'-vector contained strings, we may wish to convert
data.corrected to a numeric vector using 'data.numeric <-
as.numeric(data.corrected)'.]
> data.numeric
[1] NA 32 24 35 36 42 37 45 55 39 49 NA NA
The previous functions are most useful with lists and data-frames.
2. CALC ISSUES
=============
A possible problem are empty Calc cells. The Calc-R-parsing routine may
interpret missing values as zeroes. However, a value of '0' might be
still a valid value, so we may be forced to avoid replacing all the '0's
with 'NA' using the function unknownToNA().
Therefore, empty cells should be handled by the parsing routine and
replaced by 'NA' when pasting the data into R.
There are 2 possible problems with this automatic parsing (not directly
related to the parsing itself):
1.
empty cells at the end of a column (trailing empty cells): might
be just empty cells AND NOT values at all (not even missing values)
2.
some statistical methods need equal length for the data groups
1.
trailing empty cells
-- we may wish to import a range spanning more than one column
into R (e.g. a data frame); yet, the various columns may have
different length, and therefore we may have various amounts of
empty cells at the end of each column. Such empty cells might be
best dropped from the data set and NOT considered 'NA'-values. The
data- importing mechanism should therefore have an option: 'ignore
trailing empty cells'.
2.
equal data-group length
-- sometimes we need equal length for two (or more) data vectors.
Therefore, we may want an option to append missing values to the
end of the various vectors up to the length of the largest vector.
Sincerely,
Leonard
Wojciech Gryc wrote:
> Hi everyone,
>
> This is just a general update about the Calc/R integration project.
> Over the
> weekend I implemented a pretty neat feature (though I'm biased) which
> allows
> people to create dialog boxes in Calc through an external file rather
> than
> having it coded. The reason I did this is because I created a
rudimentary
> system where a person can actually script their own user interface
> that will
> ask for specific inputs and provide custom outputs from R within a
> spreadsheet.
>
> I won't go into details here, but check this post for more information:
> http://www.11-55.org/ooblog/?p=21
>
> You can also see a sample script file here:
> http://www.11-55.org/ooblog/wp-content/uploads/2007/05/correl-may-21.txt
>
> Any comments would be appreciated.
>
> Thanks,
> Wojciech
>
--
Five Minutes to Midnight:
Youth on human rights and current affairs
http://www.fiveminutestomidnight.org/