Re: Re[4]: [firebird-support] Find duplicate multi-row entries

Svein Erling Tysvær [email protected] [firebird-support] Thu, 28 Sep 2017 01:49:46 -0700

Sorry again, Daniel, fourth query required (I forgot that things may vary
depending on time, this is something I've never done before):


with tmp(NODE, ROUTE_UPDATED) as
(select distinct rh.NODE, rh.ROUTE_UPDATED
 from route_history rh
 join routes r on rh.NODE = r.NODE), //*This join is only for performance
reasons in case ROUTE_HISTORY contains a lot more records than ROUTES and
may be removed*/
new_routes(NODE) as
(select distinct t.NODE
 from tmp t
 where not exists(select *
                  from routes r
                  left join route_history rh on r.NODE = rh.NODE
                                            and r.ROUTE_INDEX =
rh.ROUTE_INDEX
                                            and r.LINK_NODE = rh.LINK_NODE
                                            and r.QUALITY is not distinct
from rh.QUALITY
                                            and t.ROUTE_UPDATED =
rh.ROUTE_UPDATED
                  where t.NODE = r.NODE
                    and rh.NODE is null)
  and not exists(select *
                  from route_history rh
                  left join routes r on r.NODE = rh.NODE
                                            and r.ROUTE_INDEX =
rh.ROUTE_INDEX
                                            and r.LINK_NODE = rh.LINK_NODE
                                            and r.QUALITY is not distinct
from rh.QUALITY
                  where t.NODE = rh.NODE
                    and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                    and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE

Hope I finally got it right,
Set

2017-09-28 10:01 GMT+02:00 Svein Erling Tysvær <[email protected]>:

> Sorry, I forgot that should probably use 'IS NOT DISTINCT' and not '=' for
> QUALITY.
>
> Note that this checks only checks if there has ever been anything
> identical in ROUTE_HISTORY, it doesn't restrain itself to the latest entry
> only (i.e. the query I've written is slightly similar to the second query I
> wrote, it has to be extended if you want something slightly similar to the
> first query). Also, my third query doesn't consider duplicate rows (since
> the fields seem to be the primary key in ROUTE, it isn't theoretically
> possible in your example).
>
> Set
>
> 2017-09-28 9:52 GMT+02:00 Svein Erling Tysvær <[email protected]>:
>
>> OK, that's completely different, but I've done something similar once
>> before. It is not possible to directly compare sets for equality, but it is
>> possible to check that set A doesn't contain anything that isn't in set B
>> and that set B doesn't contain anything that isn't in set A. I assume NODE
>> is the one common denominator:
>>
>> with tmp (NODE) as
>> (select distinct NODE from routes),
>> new_routes(NODE) as
>> (select t.NODE
>>  from tmp t
>>  where not exists(select *
>>                   from routes r
>>                   left join route_history rh on r.NODE = rh.NODE
>>                                             and r.ROUTE_INDEX =
>> rh.ROUTE_INDEX
>>                                             and r.LINK_NODE =
>> rh.LINK_NODE
>>                                             and r.QUALITY = rh.QUALITY
>>                   where t.NODE = r.NODE
>>                     and rh.NODE is null)
>>   and not exists(select *
>>                   from route_history rh
>>                   left join routes r on r.NODE = rh.NODE
>>                                             and r.ROUTE_INDEX =
>> rh.ROUTE_INDEX
>>                                             and r.LINK_NODE =
>> rh.LINK_NODE
>>                                             and r.QUALITY = rh.QUALITY
>>                   where t.NODE = rh.NODE
>>                     and r.NODE is null))
>> select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
>> from new_routes nr
>> join routes r on nr.NODE = r.NODE
>>
>> Does this get you the result you want?
>> Set
>>
>> 2017-09-28 9:19 GMT+02:00 'Daniel Miller' [email protected]
>> [firebird-support] <[email protected]>:
>>
>>>
>>>
>>> Thanks - but that still doesn't work.  I get a huge return set with a
>>> ton of duplicates.  I think something's being lost in translation.  A
>>> simple match of one row isn't sufficient - I need to match the "set".
>>>
>>> With the following six records:
>>> 5557    1    1116    0
>>> 5557    2    1115    0
>>> 5557    3    5104    2
>>> 5557    4    5073    2
>>> 5557    5    5222    2
>>> 5557    6    5110    2
>>>
>>> That is the list of routes a given node has at a specific point in
>>> time.  So it's perfectly possible at another time to have fewer, or more,
>>> or different routes.  And I need to consider ALL the routes in play for a
>>> given timestamp to be a single "set".  So if today I have the above list of
>>> routes, and yesterday route index 4 was looking at a different node - that
>>> would constitute a different and unique set even though the other 5 records
>>> match.
>>>
>>> 5557    1    1116    0
>>> 5557    2    1115    0
>>> 5557    3    5104    2
>>> 5557    4    9999    2
>>> 5557    5    5222    2
>>> 5557    6    5110    2
>>>
>>> So it's quite possible I will have lots of duplicates for an individual
>>> row in ROUTES when compared in ROUTE_HISTORY - what I'm trying to filter is
>>> the pattern of rows.  Only if ALL the rows of a current entries in ROUTES
>>> exist in ROUTE_HISTORY should it be considered a duplicate condition.
>>>
>>> --
>>> Daniel
>>>
>>> On 9/27/2017 11:53:48 PM, "Svein Erling Tysvær [email protected]
>>> [firebird-support]" <[email protected]> wrote:
>>>
>>> Sure it is possible to write such a query:
>>>
>>> select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
>>> from routes r
>>> left join route_history rh1 on r.node = rh1.node
>>>                           and r.route_index = rh1.route_index
>>>                           and r.link_node = rh1.link_node
>>>                           and r.quality is not distinct from rh1.quality
>>> left join route_history rh2 on rh1.node = rh2.node
>>>                           and rh1.route_index = rh2.route_index
>>>                           and rh1.route_updated < rh2.route_updated
>>> where rh1.node is null
>>>   and rh2.node is null
>>>
>>> I use two LEFT JOINS because I assume you want things to be returned if
>>> things are changed back, e.g. if you have
>>>
>>> 5557    1    1116    0
>>> 5557    1    1116    0
>>>
>>> then you only want one row in route_history, but if you have
>>>
>>> 5557    1    1116    0
>>> 5557    1    1116    1
>>> 5557    1    1116    0
>>>
>>> then you don't want the last record to be considered a duplicate, but
>>> know when it changed back. If ROUTE_INDEX is just a running number, you
>>> consider
>>>
>>> 5557    1    1116    0
>>> 5557    2    1116    0
>>>
>>> to be duplicates, and there are no gaps between the ROUTE_INDEX for each
>>> NODE, then the query can be simplified:
>>>
>>> select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
>>> from routes r
>>> left join route_history rh1 on r.node = rh1.node
>>>                           and r.route_index+1 = rh1.route_index
>>>                           and r.link_node = rh1.link_node
>>>                           and r.quality is not distinct from rh1.quality
>>> where rh1.node is null
>>>
>>> Note that I use IS NOT DISTINCT so that NULLs are considered equal. If
>>> you prefer, the left joins can be replaced by (nested) NOT EXISTS.
>>>
>>> HTH,
>>> Set
>>>
>>> 2017-09-28 7:22 GMT+02:00 'Daniel Miller' [email protected]
>>> [firebird-support] <[email protected]>:
>>>
>>>>
>>>>
>>>> On 9/27/2017 9:20:54 PM, "liviuslivius [email protected]
>>>> [firebird-support]" <[email protected]> wrote:
>>>>
>>>>
>>>> group by + having
>>>> is your friend.
>>>> E.g.
>>>> Select field1, field2, count(*)
>>>> From tablex
>>>> Group by field1, field2
>>>> Having count(*)>1
>>>>
>>>> But in your scenario i do not know if this is the solution. Question is
>>>> if record was changed and in next update "restored" and once again changed
>>>> is this duplicate or no?
>>>> If yes above query is ok if no then procedure is your friend.
>>>>
>>>> And for the future change your audit trigger and check before if there
>>>> was any change
>>>>
>>>> Thank you, but I this isn't quite what I need.  I need to match against
>>>> all the records of a set (up to 8 rows per set) - not just individual rows.
>>>>
>>>> --
>>>> Daniel
>>>>
>>>>
>>>>
>>>
>>>
>>> 
>>>
>>
>>
>

Re: Re[4]: [firebird-support] Find duplicate multi-row entries

Reply via email to