Remind me to read my posts before pressing the send button just one more time would you?
On Thu, 2003-05-29 at 09:05, Dirk Koopman wrote: > On Tue, 2003-05-27 at 19:44, Nik Butler wrote: > > Heres a problem for the perl ancients among you..... > > > > One of our customers ( I say our since like the Borg, ive joined a > > collective ) requires a regular deduplication of list information ( > > mostly CSV ) against a existing database (SQL Server 2k) . > > > > now im fairly sure that this is exactly what Perl was designed for ... > > however when searching for tools and advice on utilising those tools I > > do tend to come up a little non plussed. > > > The trouble is that people are not very consistent at writing their > addresses, neither do they spell terribly exactly. You can use one or > more of the fuzzy match algorithms, some clever sorting, together with > agrep and friends, but it will only go so far. At the end of the day > there is no substitute for human intervention and eyeball pattern > matching... > > Unfortunately, to do this properly requires fuzzy logic and some > intelligent human interaction. Basically, perl is your friend for doing > the obvious, simple stuff - ie the addresses that are identical. Also > for generating the 'possibles' you will need to scan. > > The snail mailing list specialists keep this sort of software close to > their chests because it is that which gives them the edge, viz: "clean" > (deduped) lists, that pays top dollar. > > Best of luck... > > Dirk -- Please Note: Some Quantum Physics Theories Suggest That When the Consumer Is Not Directly Observing This Product, It May Cease to Exist or Will Exist Only in a Vague and Undetermined State.
