"count.fields" is a very nice hint for a clean solution - thank you!
Joh On Sunday 06 March 2011 21:48:32 David Winsemius wrote: > On Mar 6, 2011, at 12:47 PM, Johannes Graumann wrote: > > Thank you for pointing this out. This is really inconvenient as I do > > not > > know a priori how many and where those darn cases containing an > > additional > > (or more) ":" might be ... > > There is a count.fields function that might assist with this task. > > You seem to have a multiline (variable number of lines) format of: > > NNNN:>sp|header with "|" AND white space separators > NNNN:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEEEEE > NNNN+60:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDEE > NNNN+120:VARIABLE_NUMBER_OF_CAP_LETTERS_60_CHAR_WIDE > NNNN+180:EXCEPT_LAST > > No way that read.table can work. You might create an index with the > location of the high-count headers and then reprocess. > > log.idx <- count.fields("/tmp/testfile.txt") > 1 > corpus <- readLines("/tmp/testfile.txt") > > Then parse the headers and rejoin the broken multi-line content. There > may be worked examples in the archive for variable number multi-line > file formats. > > > The seems to work, but will fail if there's a "1:sdfjhlfkh:2:adlkjf" > > somewhere (1 & 2 both integerable). > > > > na.exclude(as.integer(scan("/tmp/ > > testfile.txt",sep=":",what="integer"))) > > > > More robust pointers anyone? > > > > Joh > > > > Sarah Goslee wrote: > >> Not so much a mystery. read.table() only looks at the first 5 lines > >> when > >> decided how many columns your file has (as described in the Details > >> section of the help). > >> > >> The easiest solution is to add a col.names argument to read.table() > >> with > >> the correct number of names. > >> > >> You may want to also include as.is=TRUE if you don't want your data > >> to > >> be imported as factors. If you expect character but have factor you > >> may > >> get unexpected results later. > >> > >> Sarah > >> > >> On Sun, Mar 6, 2011 at 5:04 AM, Johannes Graumann > >> > >> <johannes_graum...@web.de> wrote: > >>> Hello, > >>> > >>> > >>> Please have a look at the code below, which I use to read in the > >>> attached > >>> file. As line 18 of the file reads "1065:>sp|Q9V3T9|ADRO_DROME > >>> NADPH:adrenodoxin oxidoreductase, mitochondrial OS=Drosophila > >>> melanogaster GN=dare PE=2 SV=1", I expect the code below to > >>> produce a 3 > >>> column data frame with most of the last column empty and line 18 to > >>> produce a data.frame row like so: > >>> > >>> V1 > >>> > >>> 1065 > >>> > >>> V2 > >>> > >>>> sp|Q9V3T9|ADRO_DROME NADPH > >>> > >>> V3 > >>> > >>> adrenodoxin oxidoreductase, mitochondrial OS=Drosophila > >>> > >>> melanogaster GN=dare PE=2 SV=1 > >>> > >>> Why is that not so? > >>> > >>> Thanks for any hint. > >>> > >>> Sincerely, Joh > >>> > >>> read.table( > >>> "/tmp/testfile.txt", > >>> sep=":", > >>> header=FALSE, > >>> quote="", > >>> fill=TRUE > >>> )[19,] > >> > >> --- > >> Sarah Goslee > >> http://www.functionaldiversity.org > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html and provide commented, > > minimal, self-contained, reproducible code. > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT
signature.asc
Description: This is a digitally signed message part.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.