Re: [firebird-support] Find duplicate multi-row entries

setysvar [email protected] [firebird-support] Thu, 28 Sep 2017 13:53:04 -0700

Sorry again, Daniel, I had to get to a place where I could test thingsbefore replying. My problem was that what I'd done before was to try tofind identical sets, not trying to find unique sets, and that made memess up the logic. I hope this query will get you what you want (andthis time I've tested before replying):


with tmp(NODE, ROUTE_UPDATED) as
(select distinct r.NODE, rh.ROUTE_UPDATED
 from routes r
 left join route_history rh on r.NODE = rh.NODE),
new_routes(NODE) as
(select distinct t.NODE
 from tmp t
 where exists(select *
              from routes r
              left join route_history rh on r.NODE = rh.NODE
                                        and r.ROUTE_INDEX = rh.ROUTE_INDEX
                                        and r.LINK_NODE = rh.LINK_NODE

and r.QUALITY is not distinctfrom rh.QUALITYand t.ROUTE_UPDATED =rh.ROUTE_UPDATED

              where t.NODE = r.NODE
                and rh.NODE is null)
    or exists(select *
              from route_history rh
              left join routes r on r.NODE = rh.NODE
                                and r.ROUTE_INDEX = rh.ROUTE_INDEX
                                and r.LINK_NODE = rh.LINK_NODE

and r.QUALITY is not distinct fromrh.QUALITY

              where t.NODE = rh.NODE
                and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE


HTH,
Set

Den 28.09.2017 11:00, skrev 'Daniel Miller' [email protected][firebird-support]:



This appears no different than a simple:

    select * from routes

So...not there yet. I think what I'm fighting is the lack of pivot orcrosstab functionality - and I may have no choice (though it offendsme deeply) but to setup my archive table for this comparison. Eithervia the hated structure of column names ROUTE1, ROUTE2, ROUTE8...orthe only slightly less offensive VARCHAR concatenation.

A concatenated LIST seems like it would be a elegant solution - andwhile operating over the whole table is time-consuming a single nodeisn't too bad. And if I can get the history filtered and oldduplicate records deleted it would be fine. But the inconsistency iskilling me.

There MUST be a Firebird-friendly, normalized, compact storagestructure that will work. I'm just not seeing it.

--
Daniel

On 9/28/2017 1:19:18 AM, "Svein Erling Tysvær [email protected][firebird-support]" <[email protected]<mailto:[email protected]>> wrote:

Sorry again, Daniel, fourth query required (I forgot that things mayvary depending on time, this is something I've never done before):


with tmp(NODE, ROUTE_UPDATED) as
(select distinct rh.NODE, rh.ROUTE_UPDATED
 from route_history rh

join routes r on rh.NODE = r.NODE), //*This join is only forperformance reasons in case ROUTE_HISTORY contains a lot more recordsthan ROUTES and may be removed*/

new_routes(NODE) as
(select distinct t.NODE
 from tmp t
 where not exists(select *
                  from routes r
                  left join route_history rh on r.NODE = rh.NODE
              and r.ROUTE_INDEX = rh.ROUTE_INDEX
              and r.LINK_NODE = rh.LINK_NODE
              and r.QUALITY is not distinct from rh.QUALITY
              and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                  where t.NODE = r.NODE
                    and rh.NODE is null)
  and not exists(select *
                  from route_history rh
                  left join routes r on r.NODE = rh.NODE
              and r.ROUTE_INDEX = rh.ROUTE_INDEX
              and r.LINK_NODE = rh.LINK_NODE
              and r.QUALITY is not distinct from rh.QUALITY
                  where t.NODE = rh.NODE
                    and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                    and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE

Hope I finally got it right,
Set

2017-09-28 10:01 GMT+02:00 Svein Erling Tysvær <[email protected]<mailto:[email protected]>>:


    Sorry, I forgot that should probably use 'IS NOT DISTINCT' and
    not '=' for QUALITY.

    Note that this checks only checks if there has ever been anything
    identical in ROUTE_HISTORY, it doesn't restrain itself to the
    latest entry only (i.e. the query I've written is slightly
    similar to the second query I wrote, it has to be extended if you
    want something slightly similar to the first query). Also, my
    third query doesn't consider duplicate rows (since the fields
    seem to be the primary key in ROUTE, it isn't theoretically
    possible in your example).

    Set

    2017-09-28 9:52 GMT+02:00 Svein Erling Tysvær <[email protected]
    <mailto:[email protected]>>:

        OK, that's completely different, but I've done something
        similar once before. It is not possible to directly compare
        sets for equality, but it is possible to check that set A
        doesn't contain anything that isn't in set B and that set B
        doesn't contain anything that isn't in set A. I assume NODE
        is the one common denominator:

        with tmp (NODE) as
        (select distinct NODE from routes),
        new_routes(NODE) as
        (select t.NODE
         from tmp t
         where not exists(select *
                          from routes r
                          left join route_history rh on r.NODE = rh.NODE
                                            and r.ROUTE_INDEX =
        rh.ROUTE_INDEX
                                            and r.LINK_NODE =
        rh.LINK_NODE
                                            and r.QUALITY = rh.QUALITY
                  where t.NODE = r.NODE
                    and rh.NODE is null)
          and not exists(select *
                          from route_history rh
                          left join routes r on r.NODE = rh.NODE
                                            and r.ROUTE_INDEX =
        rh.ROUTE_INDEX
                                            and r.LINK_NODE =
        rh.LINK_NODE
                                            and r.QUALITY = rh.QUALITY
                  where t.NODE = rh.NODE
                    and r.NODE is null))
        select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
        from new_routes nr
        join routes r on nr.NODE = r.NODE

        Does this get you the result you want?
        Set

        2017-09-28 9:19 GMT+02:00 'Daniel Miller' [email protected]
        <mailto:[email protected]> [firebird-support]
        <[email protected]
        <mailto:[email protected]>>:



            Thanks - but that still doesn't work.  I get a huge
            return set with a ton of duplicates.  I think something's
            being lost in translation.  A simple match of one row
            isn't sufficient - I need to match the "set".

            With the following six records:
            5557    1    1116    0
            5557    2    1115    0
            5557    3    5104    2
            5557    4    5073    2
            5557    5    5222    2
            5557    6    5110    2

            That is the list of routes a given node has at a specific
            point in time.  So it's perfectly possible at another
            time to have fewer, or more, or different routes.  And I
            need to consider ALL the routes in play for a given
            timestamp to be a single "set".  So if today I have the
            above list of routes, and yesterday route index 4 was
            looking at a different node - that would constitute a
            different and unique set even though the other 5 records
            match.

            5557    1   1116    0
            5557    2 1115    0
            5557    3 5104    2
            5557 4    9999    2
            5557    5 5222    2
            5557    6 5110    2

            So it's quite possible I will have lots of duplicates for
            an individual row in ROUTES when compared in
            ROUTE_HISTORY - what I'm trying to filter is the pattern
            of rows.  Only if ALL the rows of a current entries in
            ROUTES exist in ROUTE_HISTORY should it be considered a
            duplicate condition.

            --
            Daniel

            On 9/27/2017 11:53:48 PM, "Svein Erling Tysvær
            [email protected] <mailto:[email protected]>
            [firebird-support]" <[email protected]
            <mailto:[email protected]>> wrote:

            Sure it is possible to write such a query:

            select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
            from routes r
            left join route_history rh1 on r.node = rh1.node
                and r.route_index = rh1.route_index
                and r.link_node = rh1.link_node
                and r.quality is not distinct from rh1.quality
            left join route_history rh2 on rh1.node = rh2.node
                and rh1.route_index = rh2.route_index
                and rh1.route_updated < rh2.route_updated
            where rh1.node is null
            and rh2.node is null

            I use two LEFT JOINS because I assume you want things to
            be returned if things are changed back, e.g. if you have

            5557   1    1116    0
            5557   1    1116    0

            then you only want one row in route_history, but if you have

            5557   1    1116    0
            5557   1    1116    1
            5557   1    1116    0

            then you don't want the last record to be considered a
            duplicate, but know when it changed back. If ROUTE_INDEX
            is just a running number, you consider

            5557   1    1116    0
            5557   2    1116    0

            to be duplicates, and there are no gaps between the
            ROUTE_INDEX for each NODE, then the query can be simplified:

            select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
            from routes r
            left join route_history rh1 on r.node = rh1.node
                  and r.route_index+1 = rh1.route_index
                  and r.link_node = rh1.link_node
                  and r.quality is not distinct from rh1.quality
            where rh1.node is null

            Note that I use IS NOT DISTINCT so that NULLs are
            considered equal. If you prefer, the left joins can be
            replaced by (nested) NOT EXISTS.

            HTH,
            Set

            2017-09-28 7:22 GMT+02:00 'Daniel Miller'
            [email protected] <mailto:[email protected]>
            [firebird-support] <[email protected]
            <mailto:[email protected]>>:



                On 9/27/2017 9:20:54 PM, "liviuslivius
                [email protected]
                <mailto:[email protected]>
                [firebird-support]"
                <[email protected]
                <mailto:[email protected]>> wrote:


                group by + having
                is your friend.
                E.g.
                Select field1, field2, count(*)
                From tablex
                Group by field1, field2
                Having count(*)>1

                But in your scenario i do not know if this is the
                solution. Question is if record was changed and in
                next update "restored" and once again changed is
                this duplicate or no?
                If yes above query is ok if no then procedure is
                your friend.

                And for the future change your audit trigger and
                check before if there was any change

                Thank you, but I this isn't quite what I need.  I
                need to match against all the records of a set (up
                to 8 rows per set) - not just individual rows.

                --
                Daniel

Re: [firebird-support] Find duplicate multi-row entries

Reply via email to