Re: [R] Problem with parsing a dataset - help earnestly sought

2010-04-23 Thread Jan van der Laan
Something like this?

# Remove everything after ; to give the status
status- sub(';.*$', '', data$cancer.problems)

# Remove everything before the last ; to give tissue
# In case a no ; in the string this goes wrong; correct
tissue- sub('^.*;[ \n]*', '', data$cancer.problems)
tissue[! grepl(';', data$cancer.problems)] - ''

# Select the part between ;'s to give age
indices   - regexpr(';.*;', data$cancer.problems)
lengths   - attr(indices, match.length)
age   - rep(NA, length(data$cancer.problems))
age[indices0] - substring(data$cancer.problems[indices0],
indices[indices0]+1, indices[indices0]+lengths[indices0]-2)


Hope it helps.
Regards,
Jan




On Fri, Apr 23, 2010 at 6:24 AM, Min-Han Tan minhan.scie...@gmail.com wrote:
 Dear fellow R-help members,

 I hope to seek your advice on how to parse/manage a dataset with hundreds of
 columns. Two examples of these columns, 'cancer.problems', and
 'neuro.problems' are depicted below. Essentially, I need to parse this into
 a useful dataset, and unfortunately, I am not familiar with perl or any such
 language.

 data - data.frame(id=c(1:10))
 data$cancer.problems - c(Y; DX AGE: 28; COLON, , Y; DX AGE: 27;, Y;
 LIVER,,Y,Y; DX AGE: 24;,Y,Y;DX AGE: 44;,Y;DX AGE: 39; TESTIS)
 data$neuro.problems - c(Y: DX AGE: 80-89;,Y,,Y; DX AGE: 74;
 STROKE,Y; DEMENTIA,Y,,Y; DX AGE: 33; CHOREA, Y, Y; WEAKNESS)

 As can be seen, the semi-colon delimiter follows its own set of rules, which
 are internally consistent - with all 3 elements of data, it should be
 Status; Age; Tissue Type. However, if there is only tissue type, it is
 Status; Tissue Type, without the trailing semi-colon. However, if there is
 Age available, it is Status; Age;.

 The main challenge for me is how to parse/convert this dataset into a useful
 and consistent data.frame, or list, where I can capture Status, Age and
 Tissue Type as separate fields. Due to the varying application of the
 delimiter, I cannot use strsplit consistently. I have tried a convoluted
 method by identifying AGE as the character string identifying 3 element
 fields per below, but faced problems with unlist, given the empty fields.

 age.present - grepl(AGE,data[,2])
 data.3column - strsplit(data[age.present,2],;)
 data.2column - strsplit(data[!age.present,2],;)
 data$cancer.status[age.present] - unlist(data.3column)
 [(1:sum(age.present)*3)-2]
 ...

 Your advice is earnestly sought.

 Thanks.

 Min-Han

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with parsing a dataset - help earnestly sought

2010-04-23 Thread Gabor Grothendieck
We can use strapply in the gsubfn package. It extracts
fields matching regular expressions.

strapply extracts the parenthesized part of the regular
expression (or the entire regular expression if nothing
parenthesized), applies the function to it and returns
the result.  See http://gsubfn.googlecode.com

This works with the rules stated below and works on your
example but the general rules may only be apparent with
more data in which case you may need to make appropriate
adjustments.

Note that the regular expressions:

\w refers to a word character and must be written \\w when within in quotes.
+ means one or more occurrences in a row
$ means end of string


library(gsubfn)

NULL2NA - function(x) if (is.null(x)) NA else x

extract - function(x) {

# age is word that comes after DX AGE:
age - strapply(x, DX AGE: (\\w+), c)
age - sapply(age, null2NA)

# tissue is 2 or more word characters at end
tissue - strapply(x, \\w\\w+$, c)
tissue - sapply(tissue, null2NA)

data.frame(age, tissue)
}

extract(data[,2])
extract(data[,3])


On Fri, Apr 23, 2010 at 12:24 AM, Min-Han Tan minhan.scie...@gmail.com wrote:
 Dear fellow R-help members,

 I hope to seek your advice on how to parse/manage a dataset with hundreds of
 columns. Two examples of these columns, 'cancer.problems', and
 'neuro.problems' are depicted below. Essentially, I need to parse this into
 a useful dataset, and unfortunately, I am not familiar with perl or any such
 language.

 data - data.frame(id=c(1:10))
 data$cancer.problems - c(Y; DX AGE: 28; COLON, , Y; DX AGE: 27;, Y;
 LIVER,,Y,Y; DX AGE: 24;,Y,Y;DX AGE: 44;,Y;DX AGE: 39; TESTIS)
 data$neuro.problems - c(Y: DX AGE: 80-89;,Y,,Y; DX AGE: 74;
 STROKE,Y; DEMENTIA,Y,,Y; DX AGE: 33; CHOREA, Y, Y; WEAKNESS)

 As can be seen, the semi-colon delimiter follows its own set of rules, which
 are internally consistent - with all 3 elements of data, it should be
 Status; Age; Tissue Type. However, if there is only tissue type, it is
 Status; Tissue Type, without the trailing semi-colon. However, if there is
 Age available, it is Status; Age;.

 The main challenge for me is how to parse/convert this dataset into a useful
 and consistent data.frame, or list, where I can capture Status, Age and
 Tissue Type as separate fields. Due to the varying application of the
 delimiter, I cannot use strsplit consistently. I have tried a convoluted
 method by identifying AGE as the character string identifying 3 element
 fields per below, but faced problems with unlist, given the empty fields.

 age.present - grepl(AGE,data[,2])
 data.3column - strsplit(data[age.present,2],;)
 data.2column - strsplit(data[!age.present,2],;)
 data$cancer.status[age.present] - unlist(data.3column)
 [(1:sum(age.present)*3)-2]
 ...

 Your advice is earnestly sought.

 Thanks.

 Min-Han

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem with parsing a dataset - help earnestly sought

2010-04-22 Thread Min-Han Tan
Dear fellow R-help members,

I hope to seek your advice on how to parse/manage a dataset with hundreds of
columns. Two examples of these columns, 'cancer.problems', and
'neuro.problems' are depicted below. Essentially, I need to parse this into
a useful dataset, and unfortunately, I am not familiar with perl or any such
language.

data - data.frame(id=c(1:10))
data$cancer.problems - c(Y; DX AGE: 28; COLON, , Y; DX AGE: 27;, Y;
LIVER,,Y,Y; DX AGE: 24;,Y,Y;DX AGE: 44;,Y;DX AGE: 39; TESTIS)
data$neuro.problems - c(Y: DX AGE: 80-89;,Y,,Y; DX AGE: 74;
STROKE,Y; DEMENTIA,Y,,Y; DX AGE: 33; CHOREA, Y, Y; WEAKNESS)

As can be seen, the semi-colon delimiter follows its own set of rules, which
are internally consistent - with all 3 elements of data, it should be
Status; Age; Tissue Type. However, if there is only tissue type, it is
Status; Tissue Type, without the trailing semi-colon. However, if there is
Age available, it is Status; Age;.

The main challenge for me is how to parse/convert this dataset into a useful
and consistent data.frame, or list, where I can capture Status, Age and
Tissue Type as separate fields. Due to the varying application of the
delimiter, I cannot use strsplit consistently. I have tried a convoluted
method by identifying AGE as the character string identifying 3 element
fields per below, but faced problems with unlist, given the empty fields.

age.present - grepl(AGE,data[,2])
data.3column - strsplit(data[age.present,2],;)
data.2column - strsplit(data[!age.present,2],;)
data$cancer.status[age.present] - unlist(data.3column)
[(1:sum(age.present)*3)-2]
...

Your advice is earnestly sought.

Thanks.

Min-Han

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.