Hi Sun, No, I was thinking of something like hunspell, which seems to fit into the sort of work that you are doing.
Jim On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedr...@gmail.com> wrote: > Thanks Jeff. > > I'll add that to the ever-growing list my current studies are generating > daily. :-) > > Cheers > S > > > > On 10/04/15 14:32, Jeff Newmiller wrote: > >> "I suspect that it might have something to do with regular expressions, >> but to be honest, I'm (currently) pretty crap with those." >> >> I cannot think of a better incentive to take action on this hole in your >> education and buckle down to learn regular expressions. There are many >> books and tutorials available. >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >> Live... >> DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> ------------------------------------------------------------ >> --------------- >> Sent from my phone. Please excuse my brevity. >> >> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedr...@gmail.com> wrote: >> >>> Hi list >>> >>> Using the tm package, part of the pre-processing work is to remove >>> words, etc. from the corpus. >>> >>> I wish to remove people's names and also their initials which are >>> peppered throughout the corpus. But, because some people's initials are >>> >>> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or >>> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has >>> a >>> completely different meaning). >>> >>> Is there any way of doing this without leaving a trail of nonsense >>> half-terms behind? I suspect that it might have something to do with >>> regular expressions, but to be honest, I'm (currently) pretty crap with >>> >>> those. >>> >>> Would it make a difference if I removed initials and names *prior* to >>> converting all text to lower case, so I remove 'AM' and because >>> 'became' >>> is lower case, it should remain unaffected? >>> >>> Any recommendations on how best to proceed with this? >>> >>> Thanks as always. >>> Sun >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.