Re: [R] Why do I have a column called row.names?

Brian Diggs Mon, 04 Jun 2012 13:14:14 -0700

On 6/4/2012 12:12 PM, Marc Schwartz wrote:

To jump into the fray, he really needs to read the Details section of
?read.table and arguably, the source code for read.table().


It is not that the resultant data frame has row names, but that an
additional first *column name* called 'row.names' is created, which
does not exist in the source data.

The Details section has:

If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names. This allows data frames to be read in from the format in which
they are printed. If row.names is specified and does not refer to the
first column, that column is discarded from such files.

The number of data columns is determined by looking at the first five
lines of input (or the whole file if it has less than five lines), or
from the length of col.names if it is specified and is longer. This
could conceivably be wrong if fill or blank.lines.skip are true, so
specify col.names if necessary (as in the ‘Examples’).


In the source code for read.table(), which is called by read.delim()
with differing defaults, there is:

rlabp<- (cols - col1) == 1L

and a few lines further down:

if (rlabp) col.names<- c("row.names", col.names)

So the last code snippet is where a new first column name called
'row.names' is pre-pended to the column names found from reading the
header row. 'cols' and 'col1' are defined in prior code based upon
various conditions.

Not having the full data set and possibly having line wrap and TAB
problems with the text that Ed pasted into his original post, I
cannot properly replicate the conditions that cause the above code to
be triggered.

If Ed can put the entire file someplace and provide a URL for
download, perhaps we can better trace the source of the problem, or
Ed might use ?debug to follow the code execution in read.table() and
see where the relevant flags get triggered. The latter option would
help Ed learn how to use the debugging tools that R provides to dig
more deeply into such issues.

I agree that the actual file would be helpful. But I can get it tohappen if there are extra delimiters at the end of the data lines (whichthere can be with a separator of tab which is not obviously visible). Ican get it with:


BACS<-read.delim(textConnection(
"start\tstop\tSymbol\tInsert sequence\tClone End Pair\tFISH
203048\t67173930\t\tABC8-43024000D23\tTI:993812543\tTI:993834585\t
255176\t87869359\t\tABC8-43034700N15\tTI:995224581\tTI:995237913\t
1022033\t1060472\t\tABC27-1253C21\tTI:2094436044\tTI:2094696079\t
1022033\t1061172\t\tABC23-1388A1\tTI:2120730727\tTI:2121592459\t"),
                 row.names=NULL, fill=TRUE)

which gives

> BACS
  row.names    start stop           Symbol Insert.sequence
1    203048 67173930   NA ABC8-43024000D23    TI:993812543
2    255176 87869359   NA ABC8-43034700N15    TI:995224581
3   1022033  1060472   NA    ABC27-1253C21   TI:2094436044
4   1022033  1061172   NA     ABC23-1388A1   TI:2120730727
  Clone.End.Pair FISH
1   TI:993834585   NA
2   TI:995237913   NA
3  TI:2094696079   NA
4  TI:2121592459   NA

or

> str(BACS)
'data.frame':   4 obs. of  7 variables:
 $ row.names      : chr  "203048" "255176" "1022033" "1022033"
 $ start          : int  67173930 87869359 1060472 1061172
 $ stop           : logi  NA NA NA NA
 $ Symbol         : Factor w/ 4 levels "ABC23-1388A1",..: 3 4 2 1
 $ Insert.sequence: Factor w/ 4 levels "TI:2094436044",..: 3 4 1 2
 $ Clone.End.Pair : Factor w/ 4 levels "TI:2094696079",..: 3 4 1 2
 $ FISH           : logi  NA NA NA NA

The extra delimiter at the end of the line triggers theone-more-data-than-column-name condition, which then gives the row.namescolumn.

Regards,

Marc Schwartz


On Jun 4, 2012, at 1:30 PM, Bert Gunter wrote:

Actually, I think it's ?data.frame that he should read.

The salient points are that:
1. All data frames must have unique row names. If not provided, they
are produced. Row numbers **are** row names.

2. The return value of read methods are data frames.

-- Bert

On Mon, Jun 4, 2012 at 11:05 AM, David L Carlson<dcarl...@tamu.edu>  wrote:

Try help("read.delim") - always a good strategy before using a function for
the first time:

In it, you will find: "Using row.names = NULL forces row numbering. Missing
or NULL row.names generate row names that are considered to be 'automatic'
(and not preserved by as.matrix)."

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
project.org] On Behalf Of Ed Siefker
Sent: Monday, June 04, 2012 12:47 PM
To: r-help@r-project.org
Subject: [R] Why do I have a column called row.names?

I'm trying to read in a tab separated table with read.delim().
I don't particularly care what the row names are.
My data file looks like this:


start   stop    Symbol  Insert sequence Clone End Pair  FISH
203048  67173930        ABC8-43024000D23                TI:993812543
  TI:993834585
255176  87869359        ABC8-43034700N15                TI:995224581
  TI:995237913
1022033 1060472 ABC27-1253C21           TI:2094436044   TI:2094696079
1022033 1061172 ABC23-1388A1            TI:2120730727   TI:2121592459



I have to do something with row.names because my first column has
duplicate entries.  So I read in the file like this:

BACS<-read.delim("testdata.txt", row.names=NULL, fill=TRUE)
head(BACS)

   row.names    start             stop Symbol Insert.sequence
Clone.End.Pair
1    203048 67173930 ABC8-43024000D23     NA    TI:993812543
TI:993834585
2    255176 87869359 ABC8-43034700N15     NA    TI:995224581
TI:995237913
3   1022033  1060472    ABC27-1253C21     NA   TI:2094436044
TI:2094696079
4   1022033  1061172     ABC23-1388A1     NA   TI:2120730727
TI:2121592459
   FISH
1   NA
2   NA
3   NA
4   NA


Why is there a column named "row.names"?  I've tried a few different
ways of invoking this, but I always get the first column named
row.names,
and the rest of the columns shifted by one.

Obviously I could fix this by using row.names<-, but I'd like to
understand
why this happens.  Any insight?



--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Why do I have a column called row.names?

Reply via email to