[RBASE-L] - RE: GROUP BY ... HAVING: Epic Fail: Success

Bruce Chitiea Sat, 07 Jul 2012 14:02:30 -0700

All:

This is written as a data-conversion tutorial, over-long relative to the
normal posting. In the end, success; but if you've got something better
to do - like paint the garage - that's probably the best use of your
day.


This may be relevant to any rocky data conversion effort, where
loss-less preservation and extraction of non-normalized flat-file data
offers traps and challenges we'd never build into our own applications.

INTRODUCTION ****************

I concluded on waking that I was firing blindly. The binary turmoil of
my Flat File Database from Heck (F2DFH) imported table demanded a
cleaner starting point. The suggestion to extract the relevant columns
to a table where duplicates might be deleted was the first 'aha' step
towards a solution.

The ultimate goal is to create a fully normalized table of uniquely
serialized rows comprised of the relevant column data extracted from
F2DFH. We start with tens of thousands of duplicate rows - which present
no problem - and tens of thousands of rows sharing serno values whose
column data is variably strewn across the rows. Some of these rows are
fully formed, some are not; some cannot be considered fully formed until
all their related fellows have been reviewed for pieces missing from the
others.

The initial limited goal is to create a list of serno values whose rows
are non-duplicates. With R:Base, there's always a way.

STARTING POINT **************

Here's a stylized view of F2DFH contents:

   serno|col1|col2|...|coln
   ------------------------
   12345|100 |ABC |...|200
   12345|100 |ABC |...|200
   12345|100 |ABC |...|200
   12345|100 |ABC |...|200
   67890|200 |-0- |...|300
   67890|200 |XYZ |...|300
   67890|-0- |XYZ |...|300
   67890|-0- |XYZ |...|300
   67890|-0- |XYZ |...|300
   67890|-0- |XYZ |...|300
   67890|-0- |XYZ |...|400*
   67890|-0- |-0- |...|300
   67890|-0- |-0- |...|500*

   * We know from other available information that '400' and '500' are
data-entry errors to be cleaned up later.

Here's what we want to end up with in the final table:

   serno|col1|col2|...|coln
   ------------------------
   12345|100 |ABC |...|200
   67890|200 |XYZ |...|300

The serno list is a step along the way.


STEP 1: CREATE A TEMPORARY VIEW TO ELIMINATE DUPLICATION:

   CREATE TEMP VIEW tvw_humpty AS +
   SELECT DISTINCT serno,col1,col2,...,coln +
   FROM F2DFH +
   ORDER BY serno,col1,col2

...producing an entity of 14,340 unique rows which, in aggregate,
capture all field values, including nulls:

   serno|col1|col2|...|coln
   ------------------------
   12345|100 |ABC |...|200
   67890|200 |-0- |...|300
   67890|200 |XYZ |...|300
   67890|-0- |XYZ |...|300
   67890|-0- |XYZ |...|400
   67890|-0- |-0- |...|300
   67890|-0- |-0- |...|500


STEP 2: PROJECT THE VIEW TO A WORKING TABLE:

   PROJECT TEMPORARY tbl_humpty +
   FROM tvw_humpty +
   USING * +
   ORDER BY serno,col1,col2

...which simply copies the tvw_humpty data, providing a clean target for
the GROUP BY and HAVING algorithms.


STEP 3: GENERATE THE LIST OF serno VALUES APPEARING MORE THAN ONCE

   SELECT serno +
   FROM tbl_humpty t1 +
   WHERE serno IN +
    (SELECT serno +
     FROM tbl_humpty t2 +
     GROUP BY serno +
     HAVING COUNT(*) > 1)

(building on a construct given me by Bill Downall and Karen Teller two
years ago (thanks again))

...produces an intermediate list of serno values, one for each
determined group:

   serno
   -----
   67890
   67890
   67890
   67890
   67890
   67890

...which is not enough. By adding an external GROUP BY ... HAVING
clause: 

   SELECT serno +
   FROM tbl_humpty t1 +
   WHERE serno IN +
    (SELECT serno +
     FROM tbl_humpty t2 +
     GROUP BY serno +
     HAVING COUNT(*) > 1) +
   GROUP BY serno +
   HAVING COUNT(*) > 1

...we get what we want:

   serno
   -----
   67890

...in the real world this exercise addresses, one of 2,531 duplicated
serial numbers. Ugh.


As an aside, by tweaking the select statement to read: 'SELECT COUNT(*)
AS dupegroup, mvplateno,' we get a clear picture of the degree of
duplication:

   dupegroup  serno
   ---------- -----
            6 67890


STEP 4: CREATE A COMMAND FILE

--File:HUMPTY_2_DUMPTY.RMD
--Called From: <none>
--Auth: BAChitiea
--Date: 2012-0707

--Set Environment
  DEBUG SET TRACE ON
  SET MESSAGES ON
  SET ERROR MESSAGES ON

--PreCleanup
  SET ERROR MESSAGE 677 OFF
  SET ERROR MESSAGE 2038 OFF
  DROP VIEW tvw_dumpty
  DROP TABLE tbl_humpty
  DROP VIEW tvw_humpty
  SET ERROR MESSAGE 2038 ON
  SET ERROR MESSAGE 677 ON

--CREATE TEMPORARY VIEW tvw_humpty
   CREATE TEMP VIEW tvw_humpty AS +
   SELECT DISTINCT serno,col1,col2,...,coln +
   FROM F2DFH +
   ORDER BY serno,col1,col2

--PROJECT TEMPORARY TABLE tbl_humpty
   PROJECT TEMPORARY tbl_humpty +
   FROM tvw_humpty +
   USING * +
   ORDER BY serno,col1,col2

--CREATE TEMPORARY VIEW tvw_dumpty
   CREATE TEMPORARY VIEW tvw_dumpty AS +
   SELECT serno +
   FROM tbl_humpty t1 +
   WHERE serno IN +
    (SELECT serno +
     FROM tbl_humpty t2 +
     GROUP BY serno +
     HAVING COUNT(*) > 1) +
   GROUP BY serno +
   HAVING COUNT(*) > 1
   CLS
   RETURN

***************

Temporary View tvw_dumpty becomes the input for the cursor that will
drive the one-time walk through F2DFH to clean up the mess, much of
which is eyeballs-on review. If this were to become a regular affair,
I'd build a column-by-column parsing engine so that everything runs over
lunchtime.

Thank you very much if you made it this far. Any and all suggestions
welcome.

Yours,

Bruce Chitiea
SafeSectors, Inc.
eCondoMetrics
eWaterMetrics

[RBASE-L] - RE: GROUP BY ... HAVING: Epic Fail: Success

Reply via email to