[R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Adrian Dragulescu


Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND

Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But even if 
I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Duncan Murdoch

On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


It looks as though it worked, presumably because something got lost in 
your email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But even if 
I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Adrian Dragulescu




charToRaw(x)

 [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

 [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.

Adrian

I use R 2.8.1 on WinXP


On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


It looks as though it worked, presumably because something got lost in your 
email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Duncan Murdoch

On 10/14/2009 1:41 PM, Adrian Dragulescu wrote:



charToRaw(x)

  [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

  [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.

Adrian

I use R 2.8.1 on WinXP


But that's ancient.  Please try again with the beta of 2.10.0, and let 
us know if you still see a problem.


Duncan Murdoch




On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


It looks as though it worked, presumably because something got lost in your 
email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Prof Brian Ripley

On Wed, 14 Oct 2009, Adrian Dragulescu wrote:


charToRaw(x)

[1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

[1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.


We really do need the 'at a minimum' information we asked you for in 
the posting guide.  But in cp1252 (a guess as to what you might be 
using) \xad is a 'soft hyphen', and that is not the same thing as a 
hyphen -- you will get the same issues with 'non-breaking space'.


BDR



Adrian

I use R 2.8.1 on WinXP


On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


Well, I see no hyphen at all here, but then I am not on Windows.

It looks as though it worked, presumably because something got lost in your 
email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Adrian Dragulescu


I get the same results (not working) using R 2.9.2 and R.10.0 beta.

Thank you for looking at this.

On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:41 PM, Adrian Dragulescu wrote:



charToRaw(x)

  [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

  [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.

Adrian

I use R 2.8.1 on WinXP


But that's ancient.  Please try again with the beta of 2.10.0, and let us 
know if you still see a problem.


Duncan Murdoch




On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


It looks as though it worked, presumably because something got lost in 
your email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.








__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Duncan Murdoch

On 10/14/2009 2:16 PM, Adrian Dragulescu wrote:

I get the same results (not working) using R 2.9.2 and R.10.0 beta.


But it is working:  the dash is an ad in x, not a 2d.  You need to 
ask to substitute for the ad character, e.g. by


spacelongdash - rawToChar(as.raw(c(0x20, 0xad)))
gsub(spacelongdash, -, x)

Duncan Murdoch



Thank you for looking at this.

On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:41 PM, Adrian Dragulescu wrote:



charToRaw(x)

  [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

  [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.

Adrian

I use R 2.8.1 on WinXP


But that's ancient.  Please try again with the beta of 2.10.0, and let us 
know if you still see a problem.


Duncan Murdoch




On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


It looks as though it worked, presumably because something got lost in 
your email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.








__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Adrian Dragulescu


Thank you.

If I use

gsub( \xad, -, x)

[1] NEW YORK-NEW ENGLAND

I get what I want.

Adrian


sessionInfo()

R version 2.9.2 (2009-08-24)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252


attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


On Wed, 14 Oct 2009, Prof Brian Ripley wrote:


On Wed, 14 Oct 2009, Adrian Dragulescu wrote:


charToRaw(x)

[1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

[1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.


We really do need the 'at a minimum' information we asked you for in the 
posting guide.  But in cp1252 (a guess as to what you might be using) \xad is 
a 'soft hyphen', and that is not the same thing as a hyphen -- you will get 
the same issues with 'non-breaking space'.


BDR



Adrian

I use R 2.8.1 on WinXP


On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


Well, I see no hyphen at all here, but then I am not on Windows.

It looks as though it worked, presumably because something got lost in 
your email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] puzzle using gsub (and encodings maybe)

2009-10-14 Thread Duncan Murdoch

On 10/14/2009 2:29 PM, Adrian Dragulescu wrote:

Thank you.

If I use

gsub( \xad, -, x)

[1] NEW YORK-NEW ENGLAND

I get what I want.


Right, that's simpler than what I suggested.

Duncan Murdoch



Adrian


sessionInfo()

R version 2.9.2 (2009-08-24)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252


attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


On Wed, 14 Oct 2009, Prof Brian Ripley wrote:


On Wed, 14 Oct 2009, Adrian Dragulescu wrote:


charToRaw(x)

[1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44

charToRaw(y)

[1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44




So they are different.


We really do need the 'at a minimum' information we asked you for in the 
posting guide.  But in cp1252 (a guess as to what you might be using) \xad is 
a 'soft hyphen', and that is not the same thing as a hyphen -- you will get 
the same issues with 'non-breaking space'.


BDR



Adrian

I use R 2.8.1 on WinXP


On Wed, 14 Oct 2009, Duncan Murdoch wrote:


On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:

Hello,

Below is some output that shows my issue.

I have a variable x that I read from a file (more on this below)


x

[1] NEW YORK NEW ENGLAND

gsub( -, -, x)# this does not work!

[1] NEW YORK NEW ENGLAND


Well, I see no hyphen at all here, but then I am not on Windows.

It looks as though it worked, presumably because something got lost in 
your email.


Could you post charToRaw(x) so we can see what's in x?

Duncan Murdoch


Encoding(x)   # is x in a special encoding? no

[1] unknown

y = NEW YORK -NEW ENGLAND   # I type in variable y
gsub( -, -, y)# and gsub works as expected

[1] NEW YORK-NEW ENGLAND




I'm sure the problem has to do with the way I read the variable x.  But 
even if I change the encoding for x to ASCII, I still cannot do the sub.
I get x by reading a pdf file with pdftotext so you will not be able to 
replicate my issue.


Thanks for any suggestions,
Adrian


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.