Rich, I believe that the method I referred you too already had some of those rules built in. I'm sure you might be able to find a bit of pre-written conditionals for this kind of stuff. Some data miner contractors that we work with do a pretty good job of getting us clean data... of course most of them use Perl but we have a few that use Python :)
-- *Rami Kassab* - Chief Executive Officer M 503.888.8605 [email protected] LinkedIn Profile <http://www.linkedin.com/in/RamiKassab> *Typethink* - Creative Web Firm P 503.626.6231 F 503.626.6233 111 SW 5th Ave., Suite 1000 Portland, OR 97204 www.typethink.com On Tue, Sep 22, 2009 at 8:45 AM, Rich Shepard <[email protected]>wrote: > On Tue, 22 Sep 2009, Dylan Reinhardt wrote: > > Consider the following lines that you might find in a "street address" >> line: >> >> - Dept. of Motor Vehicles >> - Attn: Guido van Rossum >> - Attn: Henry Higgins III >> - San Francisco Chapter >> - PO Box 1234 >> - Mail Stop C5A >> - Attn: A/R >> > > Dylan, > > The above are true for generic situations. The data I want to clean have > the equivalent of all of the above in separate strings, and the name of the > entity, too. So, if all strings are placed in title case (i.e., the first > letter of each word capitalized), it will be relatively trivial to correct > the meaningful strings (e.g., "Or" to "OR" and "Po" to "PO") afterwards. > > Rich > > -- > Richard B. Shepard, Ph.D. | Integrity Credibility > Applied Ecosystem Services, Inc. | Innovation > <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: > 503-667-8863 > > _______________________________________________ > Portland mailing list > [email protected] > http://mail.python.org/mailman/listinfo/portland > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/portland/attachments/20090922/92b4dd01/attachment.htm> _______________________________________________ Portland mailing list [email protected] http://mail.python.org/mailman/listinfo/portland
