Re: [R] how to separate string from numbers in a large txt file

David Winsemius Thu, 16 May 2019 13:08:22 -0700


On 5/16/19 12:30 PM, Michael Boulineau wrote:

Thanks for this tip on etiquette, David. I will be sure and not do that again.

I tried the read.fwf from the foreign package, with a code like this:

  d <- read.fwf("hangouts-conversation.txt",
                 widths= c(10,10,20,40),
                 col.names=c("date","time","person","comment"),
                 strip.white=TRUE)

But it threw this error:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
   line 6347 did not have 4 elements



So what does line 6347 look like? (Use `readLines` and print it out.)


Interestingly, though, the error only happened when I increased the
width size. But I had to increase the size, or else I couldn't "see"
anything.  The comment was so small that nothing was being captured by
the size of the column. so to speak.

It seems like what's throwing me is that there's no comma that
demarcates the end of the text proper. For example:

Not sure why you thought there should be a comma. Lines usually endwith <cr> and or a <lf>.

Once you have the raw text in a character vector from `readLines` named,say, 'chrvec', then you could selectively substitute commas for spaceswith regex. (Now that you no longer desire to remove the dates and times.)


sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec)

This will not do any replacements when the pattern is not matched. Seethis test:



> newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec)
> newvec
 [1] "2016-07-01,02:50:35,<john>,hey"
 [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
 [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
 [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really"
 [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep"

[6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I amreally"

 [7] "2016-07-01,02:54:17,<john>,just know it's london"
 [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
 [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"

[12] "2016-07-01,03:02:48,<john>,British security is a little morerigorous..."



You should probably remove the "empty comment" lines.


--

David.


2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01
15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane
Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was
lots of Starbucks in my day2016-07-01 15:35:47

It was interesting, too, when I pasted the text into the email, it
self-formatted into the way I wanted it to look. I had to manually
make it look like it does above, since that's the way that it looks in
the txt file. I wonder if it's being organized by XML or something.

Anyways, There's always a space between the two sideways carrots, just
like there is right now: <John Doe> See. Space. And there's always a
space between the data and time. Like this. 2016-07-01 15:34:30 See.
Space. But there's never a space between the end of the comment and
the next date. Like this: We were in a starbucks2016-07-01 15:35:02
See. starbucks and 2016 are smooshed together.

This code is also on the table right now too.

a <- read.table("E:/working
directory/-189/hangouts-conversation2.txt", quote="\"",
comment.char="", fill=TRUE)

h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])

aa<-gsub("[^[:digit:]]","",h)
my.data.num <- as.numeric(str_extract(h, "[0-9]+"))

Those last lines are a work in progress. I wish I could import a
picture of what it looks like when it's translated into a data frame.
The fill=TRUE helped to get the data in table that kind of sort of
works, but the comments keep bleeding into the data and time column.
It's like

2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
over               there
2016-07-01 15:59:27 <Jane Doe> It confuses me :(

And then, maybe, the "seriously" will be in a column all to itself, as
will be the "I've'"and the "never" etc.

I will use a regular expression if I have to, but it would be nice to
keep the dates and times on there. Originally, I thought they were
meaningless, but I've since changed my mind on that count. The time of
day isn't so important. But, especially since, say, Gmail itself knows
how to quickly recognize what it is, I know it can be done. I know
this data has structure to it.

Michael



On Wed, May 15, 2019 at 8:47 PM David Winsemius <[email protected]> wrote:


On 5/15/19 4:07 PM, Michael Boulineau wrote:

I have a wild and crazy text file, the head of which looks like this:

2016-07-01 02:50:35 <john> hey
2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh
2016-07-01 02:51:45 <john> thinking about my boo
2016-07-01 02:52:07 <jane> nothing crappy has happened, not really
2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep
2016-07-01 02:54:08 <jane> no idea what time it is or where I am really
2016-07-01 02:54:17 <john> just know it's london
2016-07-01 02:56:44 <jane> you are probably asleep
2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay
2016-07-01 02:58:56 <jone>
2016-07-01 02:59:34 <jane>
2016-07-01 03:02:48 <john> British security is a little more rigorous...

Looks entirely not-"crazy". Typical log file format.

Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex
(i.e. the sub-function) to strip everything up to the "<". Read
`?regex`. Since that's not a metacharacters you could use a pattern
".+<" and replace with "".

And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp,
at least within hours of each, is considered poor manners.


--

David.

It goes on for a while. It's a big file. But I feel like it's going to
be difficult to annotate with the coreNLP library or package. I'm
doing natural language processing. In other words, I'm curious as to
how I would shave off the dates, that is, to make it look like:

<john> hey
<jane> waiting for plane to Edinburgh
   <john> thinking about my boo
<jane> nothing crappy has happened, not really
<john> plane went by pretty fast, didn't sleep
<jane> no idea what time it is or where I am really
<john> just know it's london
<jane> you are probably asleep
<jane> I hope fish was fishy in a good eay
   <jone>
<jane>
<john> British security is a little more rigorous...

To be clear, then, I'm trying to clean a large text file by writing a
regular expression? such that I create a new object with no numbers or
dates.

Michael

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] how to separate string from numbers in a large txt file

Reply via email to