Re[4]: [firebird-support] Find duplicate multi-row entries

'Daniel Miller' dmil...@amfes.com [firebird-support] Sun, 01 Oct 2017 15:51:49 -0700

After further review...I found I goofed. This one actually gives theresults I was expecting - but I'm still asking about how to join withit.


CREATE PROCEDURE CHECK_ROUTE_EXISTS (
    CHECK_NODE smallint )
RETURNS (
    NODE smallint,
    ROUTE_UPDATED timestamp )
AS
declare variable RU1 timestamp;
declare variable RU2 timestamp;
declare variable RU3 timestamp;
begin
  for select RH1.ROUTE_UPDATED
    from ROUTE_HISTORY RH1
    join ROUTES R1 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)
    where RH1.NODE = :CHECK_NODE and RH1.ROUTE_INDEX = 1
    order by NODE, ROUTE_UPDATED, ROUTE_INDEX
    into RU1 do
      for select RH2.ROUTE_UPDATED
      from ROUTE_HISTORY RH2
      join ROUTES R2 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)

where RH2.NODE = :CHECK_NODE and RH2.ROUTE_UPDATED=:RU1 andRH2.ROUTE_INDEX = 2

      order by NODE, ROUTE_UPDATED, ROUTE_INDEX
      into RU2 do
        for select RH3.ROUTE_UPDATED
        from ROUTE_HISTORY RH3
        join ROUTES R3 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)

where RH3.NODE = :CHECK_NODE and RH3.ROUTE_UPDATED=:RU2 andRH3.ROUTE_INDEX = 3

        order by NODE, ROUTE_UPDATED, ROUTE_INDEX
        into RU3 do
          for select RH4.ROUTE_UPDATED
          from ROUTE_HISTORY RH4
          join ROUTES R4 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)

where RH4.NODE = :CHECK_NODE and RH4.ROUTE_UPDATED=:RU3 andRH4.ROUTE_INDEX = 4

          order by NODE, ROUTE_UPDATED, ROUTE_INDEX
          into :ROUTE_UPDATED do begin
            NODE = CHECK_NODE;
            suspend;
          end
END^


--
Daniel

On 10/1/2017 3:31:57 PM, "'Daniel Miller' dmil...@amfes.com[firebird-support]" <firebird-support@yahoogroups.com> wrote:

Maybe a little progress.

This procedure does exactly what I want...but not exactly HOW I want:

CREATE PROCEDURE CHECK_ROUTE_EXISTS (
    CHECK_NODE smallint )
RETURNS (
    NODE smallint,
    ROUTE_UPDATED timestamp )
AS
declare variable PLACEHOLDER timestamp;
begin
  /* Assignment for valid return value */
  NODE = CHECK_NODE;
  for select RH1.ROUTE_UPDATED
    from ROUTE_HISTORY RH1
    join ROUTES R1 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)
    where RH1.NODE = :CHECK_NODE and RH1.ROUTE_INDEX = 1
    order by NODE, ROUTE_UPDATED, ROUTE_INDEX
    into PLACEHOLDER do
      for select RH2.ROUTE_UPDATED
      from ROUTE_HISTORY RH2
      join ROUTES R2 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)
      where RH2.NODE = :CHECK_NODE and RH2.ROUTE_INDEX = 2
      order by NODE, ROUTE_UPDATED, ROUTE_INDEX
      into PLACEHOLDER do
        for select RH3.ROUTE_UPDATED
        from ROUTE_HISTORY RH3
        join ROUTES R3 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)
        where RH3.NODE = :CHECK_NODE and RH3.ROUTE_INDEX = 3
        order by NODE, ROUTE_UPDATED, ROUTE_INDEX
        into PLACEHOLDER do
          for select RH4.ROUTE_UPDATED
          from ROUTE_HISTORY RH4
          join ROUTES R4 using (NODE, ROUTE_INDEX, LINK_NODE, QUALITY)
          where RH4.NODE = :CHECK_NODE and RH4.ROUTE_INDEX = 4
          order by NODE, ROUTE_UPDATED, ROUTE_INDEX
          into :ROUTE_UPDATED do suspend;
  /* If matched across 4 levels this is a duplicate.
      otherwise ROUTE_UPDATED will be NULL */
END
So...this gives me a list of all duplicates for a given NODE. Since Inow have something somewhat functional - some optimization questions:
1. I don't think I would gain anything by tweaking the above intousing recursion - but if someone thinks so...please share.
2. I would like to be able to run this against/with other queries.Among other reasons - I want to look at doing a bulk delete in myhistory table. At the moment, I can "manually" run something like:SELECT p.NODE, p.ROUTE_UPDATED FROM CHECK_ROUTE_EXISTS('CHECK_NODE') p;
but I'm not sure how to apply this against my entire NODES table, norhow to run a delete with it. I tried:delete from ROUTE_HISTORY H where H.NODE=5003 and H.ROUTE_UPDATEDin (select ROUTE_UPDATED from CHECK_ROUTE_EXISTS(5003))
but that broke my poor server's brain - I had to stop the statement.

--
Daniel

On 9/30/2017 9:39:31 PM, "Daniel Miller" <dmil...@amfes.com> wrote:
Set,
Thank you. This definitely does something...to be honest I'm notcertain exactly what... I'm definitely going to need to study this fora while. I sincerely appreciate the effort - though I think I am goingto reconsider my table structure.
--
Daniel
On 9/28/2017 1:42:18 PM, "setysvar setys...@gmail.com[firebird-support]" <firebird-support@yahoogroups.com> wrote:
Sorry again, Daniel, I had to get to a place where I could testthings before replying. My problem was that what I'd done before wasto try to find identical sets, not trying to find unique sets, andthat made me mess up the logic. I hope this query will get you whatyou want (and this time I've tested before replying):
with tmp(NODE, ROUTE_UPDATED) as
(select distinct r.NODE, rh.ROUTE_UPDATED
 from routes r
 left join route_history rh on r.NODE = rh.NODE),
new_routes(NODE) as
(select distinct t.NODE
 from tmp t
 where exists(select *
              from routes r
              left join route_history rh on r.NODE = rh.NODE
and r.ROUTE_INDEX =rh.ROUTE_INDEXand r.LINK_NODE =rh.LINK_NODEand r.QUALITY is not distinctfrom rh.QUALITYand t.ROUTE_UPDATED =rh.ROUTE_UPDATED
              where t.NODE = r.NODE
                and rh.NODE is null)
    or exists(select *
              from route_history rh
              left join routes r on r.NODE = rh.NODE
                                and r.ROUTE_INDEX = rh.ROUTE_INDEX
                                and r.LINK_NODE = rh.LINK_NODE
and r.QUALITY is not distinct fromrh.QUALITY
              where t.NODE = rh.NODE
                and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE

HTH,
Set
Den 28.09.2017 11:00, skrev 'Daniel Miller' dmil...@amfes.com[firebird-support]:
This appears no different than a simple:

    select * from routes
So...not there yet. I think what I'm fighting is the lack of pivotor crosstab functionality - and I may have no choice (though itoffends me deeply) but to setup my archive table for thiscomparison. Either via the hated structure of column names ROUTE1,ROUTE2, ROUTE8...or the only slightly less offensive VARCHARconcatenation.
A concatenated LIST seems like it would be a elegant solution - andwhile operating over the whole table is time-consuming a single nodeisn't too bad. And if I can get the history filtered and oldduplicate records deleted it would be fine. But the inconsistencyis killing me.
There MUST be a Firebird-friendly, normalized, compact storagestructure that will work. I'm just not seeing it.
--
Daniel
On 9/28/2017 1:19:18 AM, "Svein Erling Tysvær setys...@gmail.com[firebird-support]" <firebird-support@yahoogroups.com> wrote:
Sorry again, Daniel, fourth query required (I forgot that thingsmay vary depending on time, this is something I've never donebefore):
with tmp(NODE, ROUTE_UPDATED) as
(select distinct rh.NODE, rh.ROUTE_UPDATED
 from route_history rh
join routes r on rh.NODE = r.NODE), //*This join is only forperformance reasons in case ROUTE_HISTORY contains a lot morerecords than ROUTES and may be removed*/
new_routes(NODE) as
(select distinct t.NODE
 from tmp t
 where not exists(select *
                  from routes r
                  left join route_history rh on r.NODE = rh.NODE
and r.ROUTE_INDEX =rh.ROUTE_INDEXand r.LINK_NODE =rh.LINK_NODEand r.QUALITY is notdistinct from rh.QUALITYand t.ROUTE_UPDATED =rh.ROUTE_UPDATED
                  where t.NODE = r.NODE
                    and rh.NODE is null)
  and not exists(select *
                  from route_history rh
                  left join routes r on r.NODE = rh.NODE
and r.ROUTE_INDEX =rh.ROUTE_INDEXand r.LINK_NODE =rh.LINK_NODEand r.QUALITY is notdistinct from rh.QUALITY
                  where t.NODE = rh.NODE
                    and t.ROUTE_UPDATED = rh.ROUTE_UPDATED
                    and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE

Hope I finally got it right,
Set
2017-09-28 10:01 GMT+02:00 Svein Erling Tysvær<setys...@gmail.com>:
Sorry, I forgot that should probably use 'IS NOT DISTINCT' and not'=' for QUALITY.
Note that this checks only checks if there has ever been anythingidentical in ROUTE_HISTORY, it doesn't restrain itself to thelatest entry only (i.e. the query I've written is slightly similarto the second query I wrote, it has to be extended if you wantsomething slightly similar to the first query). Also, my thirdquery doesn't consider duplicate rows (since the fields seem to bethe primary key in ROUTE, it isn't theoretically possible in yourexample).
Set
2017-09-28 9:52 GMT+02:00 Svein Erling Tysvær<setys...@gmail.com>:
OK, that's completely different, but I've done something similaronce before. It is not possible to directly compare sets forequality, but it is possible to check that set A doesn't containanything that isn't in set B and that set B doesn't containanything that isn't in set A. I assume NODE is the one commondenominator:
with tmp (NODE) as
(select distinct NODE from routes),
new_routes(NODE) as
(select t.NODE
 from tmp t
 where not exists(select *
                  from routes r
                  left join route_history rh on r.NODE = rh.NODE
and r.ROUTE_INDEX =rh.ROUTE_INDEXand r.LINK_NODE =rh.LINK_NODEand r.QUALITY =rh.QUALITY
                  where t.NODE = r.NODE
                    and rh.NODE is null)
  and not exists(select *
                  from route_history rh
                  left join routes r on r.NODE = rh.NODE
and r.ROUTE_INDEX =rh.ROUTE_INDEXand r.LINK_NODE =rh.LINK_NODEand r.QUALITY =rh.QUALITY
                  where t.NODE = rh.NODE
                    and r.NODE is null))
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from new_routes nr
join routes r on nr.NODE = r.NODE

Does this get you the result you want?
Set
2017-09-28 9:19 GMT+02:00 'Daniel Miller' dmil...@amfes.com[firebird-support] <firebird-support@yahoogroups.com<mailto:firebird-support@yahoogroups.com>>:
Thanks - but that still doesn't work. I get a huge return setwith a ton of duplicates. I think something's being lost intranslation. A simple match of one row isn't sufficient - Ineed to match the "set".
With the following six records:
5557    1    1116    0
5557    2    1115    0
5557    3    5104    2
5557    4    5073    2
5557    5    5222    2
5557    6    5110    2
That is the list of routes a given node has at a specific pointin time. So it's perfectly possible at another time to havefewer, or more, or different routes. And I need to consider ALLthe routes in play for a given timestamp to be a single "set".So if today I have the above list of routes, and yesterday routeindex 4 was looking at a different node - that would constitutea different and unique set even though the other 5 recordsmatch.
5557    1    1116    0
5557    2    1115    0
5557    3    5104    2
5557    4    9999    2
5557    5    5222    2
5557    6    5110    2
So it's quite possible I will have lots of duplicates for anindividual row in ROUTES when compared in ROUTE_HISTORY - whatI'm trying to filter is the pattern of rows. Only if ALL therows of a current entries in ROUTES exist in ROUTE_HISTORYshould it be considered a duplicate condition.
--
Daniel
On 9/27/2017 11:53:48 PM, "Svein Erling Tysværsetys...@gmail.com [firebird-support]"<firebird-support@yahoogroups.com<mailto:firebird-support@yahoogroups.com>> wrote:
Sure it is possible to write such a query:

select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from routes r
left join route_history rh1 on r.node = rh1.node
                          and r.route_index = rh1.route_index
                          and r.link_node = rh1.link_node
and r.quality is not distinct fromrh1.quality
left join route_history rh2 on rh1.node = rh2.node
                          and rh1.route_index = rh2.route_index
and rh1.route_updated <rh2.route_updated
where rh1.node is null
  and rh2.node is null
I use two LEFT JOINS because I assume you want things to bereturned if things are changed back, e.g. if you have
5557    1    1116    0
5557    1    1116    0

then you only want one row in route_history, but if you have

5557    1    1116    0
5557    1    1116    1
5557    1    1116    0
then you don't want the last record to be considered aduplicate, but know when it changed back. If ROUTE_INDEX isjust a running number, you consider
5557    1    1116    0
5557    2    1116    0
to be duplicates, and there are no gaps between the ROUTE_INDEXfor each NODE, then the query can be simplified:
select r.NODE, r.ROUTE_INDEX, r.LINK_NODE, r.QUALITY
from routes r
left join route_history rh1 on r.node = rh1.node
                          and r.route_index+1 = rh1.route_index
                          and r.link_node = rh1.link_node
and r.quality is not distinct fromrh1.quality
where rh1.node is null
Note that I use IS NOT DISTINCT so that NULLs are consideredequal. If you prefer, the left joins can be replaced by(nested) NOT EXISTS.
HTH,
Set
2017-09-28 7:22 GMT+02:00 'Daniel Miller' dmil...@amfes.com[firebird-support] <firebird-support@yahoogroups.com<mailto:firebird-support@yahoogroups.com>>:
On 9/27/2017 9:20:54 PM, "liviusliviusliviusliv...@poczta.onet.pl [firebird-support]"<firebird-support@yahoogroups.com<mailto:firebird-support@yahoogroups.com>> wrote:
group by + having
is your friend.
E.g.
Select field1, field2, count(*)
From tablex
Group by field1, field2
Having count(*)>1
But in your scenario i do not know if this is the solution.Question is if record was changed and in next update"restored" and once again changed is this duplicate or no?
If yes above query is ok if no then procedure is your friend.
And for the future change your audit trigger and check beforeif there was any change
Thank you, but I this isn't quite what I need. I need tomatch against all the records of a set (up to 8 rows per set)- not just individual rows.
--
Daniel

Re[4]: [firebird-support] Find duplicate multi-row entries

Reply via email to