I tried reading it with read.table, as below, and didn't see any obvious

> txt <- readLines("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv";)
> str(txt)
 chr [1:11315] ",target,text" ...
> writeLines(substring(txt[1:5],1,40))
0,9,"From: cub...@garnet.berkeley.edu (
1,4,"From: gnel...@pion.rutgers.edu (Gre
2,11,"From: crypt-comme...@math.ncsu.edu
3,4,"From:  () Subject: Re: Quadra SCSI
> z <- read.table(text=txt, sep=",", quote="\"", header=TRUE,
> str(z)
'data.frame':   11314 obs. of  3 variables:
 $ X     : int  0 1 2 3 4 5 6 7 8 9 ...
 $ target: int  9 4 11 4 0 4 5 5 13 12 ...
 $ text  : chr  "From: cub...@garnet.berkeley.edu (
      ) Subject: Re: Cubs behind Marlins? How? Artic"| __truncated__ "From:
gnel...@pion.rutgers.edu (Gregory Nelson) Subject: Thanks Apple: Free
Ethernet on my C610! Article-I.D.: "| __truncated__ "From:
crypt-comme...@math.ncsu.edu Subject: Cryptography FAQ 10/10 - References
Organization: The Crypt Cabal L"| __truncated__ "From:  () Subject: Re:
Quadra SCSI Problems??? Organization: Apple Computer Inc. Lines: 28  >
ATTENTION: Mac Qu"| __truncated__ ...
> summary(z)
       X             target           text
 Min.   :    0   Min.   : 0.000   Length:11314
 1st Qu.: 2828   1st Qu.: 5.000   Class :character
 Median : 5656   Median : 9.000   Mode  :character
 Mean   : 5656   Mean   : 9.293
 3rd Qu.: 8485   3rd Qu.:14.000
 Max.   :11313   Max.   :19.000
> sum(is.na(z$text))
[1] 0
> table(substring(z$text,1,5))

 cs.u  egsn  howl  sgib  uune  wisc  wupo  zaph Distr From: Nntp- Organ
    4     1     2     1     1     1     2     4    14 10795    13   125
Reply Subje To: g X-Mai
    3   343     1     3

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Aug 11, 2017 at 1:58 AM, <mohan.radhakrish...@cognizant.com> wrote:

> Yes. I tried that already. Not straightforward.
> data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F,
> quote="", sep=",", encoding="UTF-8")
> This line does read it haphazardly. The emails in the column are split
> into multiple columns and there are several columns with just ‘NA’. Totally
> 202 columns.
> And then I removed columns with NA’s and concatenated all the text and
> finally got it.
> munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))]
> munged <- munged[-1,]
> munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 ,
> collapse = " ")
> munged <- munged[,c("V1","V2","text")]
> print(head(munged$text))
> Mohan
> From: Adams, Jean [mailto:jvad...@usgs.gov]
> Sent: Thursday, August 10, 2017 8:03 PM
> To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com>
> Cc: R help <r-help@r-project.org>
> Subject: Re: [R] EOF within quoted string
> You might want to try some of the suggestions mentioned in this post:
> https://stackoverflow.com/q/17414776/2140956
> Jean
> On Thu, Aug 10, 2017 at 7:59 AM, <mohan.radhakrish...@cognizant.com
> <mailto:mohan.radhakrish...@cognizant.com>> wrote:
> Hi,
> Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading
> it using
> data <- read.csv("20_newsgroups.csv",header=TRUE)
> throws this.
> Warning message:
> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   EOF within quoted string
> So, for example, the first line in the file is this. This column contains
> only such text. Is there a way read it ?
> From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> ()
> Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a
> Organization: University of California, Berkeley Lines: 12
> NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu>
> gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes:  morgan
> and guzman will have era's 1 run higher than last year, and  the cubs will
> be idiots and not pitch harkey as much as hibbard.  castillo won't be good
> (i think he's a stud pitcher)         This season so far, Morgan and Guzman
> helped to lead the Cubs        at top in ERA, even better than THE rotation
> at Atlanta.        Cubs ERA at 0.056 while Braves at 0.059. We know it is
> early        in the season, we Cubs fans have learned how to enjoy the
>   short triumph while it is still there.
> Thanks,
> Mohan
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>         [[alternative HTML version deleted]]
> ______________________________________________
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>         [[alternative HTML version deleted]]
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

        [[alternative HTML version deleted]]

R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to