How to proceed on two PRs for deduplicating records

Mike Thomsen Sun, 01 Nov 2020 14:20:59 -0800

A first time contributor named Adam Fisher and I submitted PRs for a
"deduplicate record" processor roughly at the same time. His was
focused mainly around removing duplicates from within a record set
using the record set itself as the source of truth, whereas mine
relied on a DistributedMapCache and record path operations to focus on
data lake-wide deduplication.


Here's his PR for reference: https://github.com/apache/nifi/pull/3317

The Git history is fairly broken at this point (I tried a rebase and
found some really bad merge commits), but I was able to squash it and
cherry-pick it onto main.

I think they're two separate use cases and should probably be two
separate processors in order to keep things simple.

Before I put much effort into pushing both PRs along, I'd like to know
if anyone else has any preferences/ideas on this.

Thanks,

Mike

How to proceed on two PRs for deduplicating records

Reply via email to