I tried reading it with read.table, as below, and didn't see any obvious problems.
> txt <- readLines("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv") > str(txt) chr [1:11315] ",target,text" ... > writeLines(substring(txt[1:5],1,40)) ,target,text 0,9,"From: cub...@garnet.berkeley.edu ( 1,4,"From: gnel...@pion.rutgers.edu (Gre 2,11,"From: crypt-comme...@math.ncsu.edu 3,4,"From: () Subject: Re: Quadra SCSI > z <- read.table(text=txt, sep=",", quote="\"", header=TRUE, stringsAsFactors=FALSE) > str(z) 'data.frame': 11314 obs. of 3 variables: $ X : int 0 1 2 3 4 5 6 7 8 9 ... $ target: int 9 4 11 4 0 4 5 5 13 12 ... $ text : chr "From: cub...@garnet.berkeley.edu ( ) Subject: Re: Cubs behind Marlins? How? Artic"| __truncated__ "From: gnel...@pion.rutgers.edu (Gregory Nelson) Subject: Thanks Apple: Free Ethernet on my C610! Article-I.D.: "| __truncated__ "From: crypt-comme...@math.ncsu.edu Subject: Cryptography FAQ 10/10 - References Organization: The Crypt Cabal L"| __truncated__ "From: () Subject: Re: Quadra SCSI Problems??? Organization: Apple Computer Inc. Lines: 28 > ATTENTION: Mac Qu"| __truncated__ ... > summary(z) X target text Min. : 0 Min. : 0.000 Length:11314 1st Qu.: 2828 1st Qu.: 5.000 Class :character Median : 5656 Median : 9.000 Mode :character Mean : 5656 Mean : 9.293 3rd Qu.: 8485 3rd Qu.:14.000 Max. :11313 Max. :19.000 > sum(is.na(z$text)) [1] 0 > table(substring(z$text,1,5)) cs.u egsn howl sgib uune wisc wupo zaph Distr From: Nntp- Organ Orgin 4 1 2 1 1 1 2 4 14 10795 13 125 1 Reply Subje To: g X-Mai 3 343 1 3 Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Aug 11, 2017 at 1:58 AM, <mohan.radhakrish...@cognizant.com> wrote: > Yes. I tried that already. Not straightforward. > > data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, > quote="", sep=",", encoding="UTF-8") > > This line does read it haphazardly. The emails in the column are split > into multiple columns and there are several columns with just ‘NA’. Totally > 202 columns. > > And then I removed columns with NA’s and concatenated all the text and > finally got it. > > munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))] > munged <- munged[-1,] > munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , > collapse = " ") > > munged <- munged[,c("V1","V2","text")] > > print(head(munged$text)) > > Mohan > > From: Adams, Jean [mailto:jvad...@usgs.gov] > Sent: Thursday, August 10, 2017 8:03 PM > To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com> > Cc: R help <r-help@r-project.org> > Subject: Re: [R] EOF within quoted string > > You might want to try some of the suggestions mentioned in this post: > https://stackoverflow.com/q/17414776/2140956 > > Jean > > On Thu, Aug 10, 2017 at 7:59 AM, <mohan.radhakrish...@cognizant.com > <mailto:mohan.radhakrish...@cognizant.com>> wrote: > Hi, > > Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading > it using > > data <- read.csv("20_newsgroups.csv",header=TRUE) > > throws this. > > Warning message: > In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > EOF within quoted string > > So, for example, the first line in the file is this. This column contains > only such text. Is there a way read it ? > > From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> () > Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a > Organization: University of California, Berkeley Lines: 12 > NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu> > gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes: morgan > and guzman will have era's 1 run higher than last year, and the cubs will > be idiots and not pitch harkey as much as hibbard. castillo won't be good > (i think he's a stud pitcher) This season so far, Morgan and Guzman > helped to lead the Cubs at top in ERA, even better than THE rotation > at Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know it is > early in the season, we Cubs fans have learned how to enjoy the > short triumph while it is still there. > > Thanks, > Mohan > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.