On 05/25/2011 03:13 PM, Tarlika Elisabeth Schmitz wrote:
On Wed, 25 May 2011 09:25:48 -0600
Rob Sargent<robjsarg...@gmail.com>  wrote:



On 05/24/2011 10:57 AM, Lew wrote:
Tarlika Elisabeth Schmitz wrote:

CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(256) NOT NULL,
"location" character varying(256),
CONSTRAINT person_pkey PRIMARY KEY (id)
);

this was just a TEMPORARY table I created for quick analysis
of my CSV data (now renamed to temp_person).

CREATE TABLE country
(
   id character varying(3) NOT NULL, -- alpha-3 code
   "name" character varying(50) NOT NULL,
   CONSTRAINT country_pkey PRIMARY KEY (id)
);


To minimize the ultimately quite necessary human adjudication, one
might make good use of what is often termed "crowd sourcing":  Keep
all the distinct "hand entered" values and a map to the final human
assessment.

I was wondering how to do just that. I don't think it would be a good
idea to hard code this into the clean-up script. Take, for instance,
variations of COUNTRY.NAME spelling. Where would I store these?

I could do with a concept for this problem, which applies to a lot of
string-type info.

I think you keep your current structures used for deducing the canonical forms, but with each unique input encounter you add it to you seen-thus-far list which becomes just one more check (possibly the first such check).

create table address_input
(
   id unique/sequence,
   human_input character varying(256),
   resolution character varying(256)
)

You may have to add a column for the type of input (if you know for instance the input is for street address v. country) or you may want the resolution to be portioned in to county, city and so on.


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to