dmainou opened a new issue, #6369:
URL: https://github.com/apache/hop/issues/6369

   ### What would you like to happen?
   
   **Summary**
   
   Add the ability to pass through selected fields from either the Reference or 
Comparison dataset into the Merge-Diff output, even when those fields are not 
part of the key or comparison set.
   
   This enables downstream systems (e.g. APIs) to receive required identifiers 
(like CRM IDs) without requiring slow, error-prone lookups after the diff.
   
   **Problem**
   
   In a typical ERP → CRM sync flow:
   
   - ERP is the Comparison stream
   - CRM is the Reference stream
   - Merge-Diff correctly identifies identical, changed, deleted and new rows. 
Also outputs a JSON with the list of changes and passes through columns that 
are neither a key or a comparison field.
   
   However:
   When a row is deleted or changed the data from the Reference side is 
partially lost as the Compared data is kept. You are able to easily retrieve 
values from the changes json but unable to determine what was the original data 
on fields that are there just for the ride. Fields like ID's  (e.g CRM ID's) 
that for merging purposes have been created as nulls on the Comparison side 
(e.g. ERP).
   
   
   Source | product_name (key) | price (value) | crm_id
   -- | -- | -- | --
   ERP (Comparison) | sample | 10.00 | Null
   CRM (Reference) | sample | 10.00 | 123
   
   crm_id is required to form the CRM API update call, but:
   
   - It is not a key
   - It is not a compare field
   - Therefore it is dropped from the Merge-Diff output
   
   This forces users to perform a lookup back into CRM data which is slow, and 
at times has caused issues due to (duplicate names, bad data, etc.)
   
   ** Proposed Solution **
   
   Add a new optional tab to the Merge-Diff transform: “Pass-Through Fields”
   
   This tab allows users to select fields that should always be included in the 
Merge-Diff output, regardless of diff outcome.
   
   UI Design
   
   A grid with 3 columns:
   
   Column | Description
   -- | --
   Target dataset | Reference or Comparison
   Target field | Field name from that stream
   New field name |Name to appear in the output
   
   Target dataset | Target field | New field name
   -- | -- | --
   Reference | crm_id | their_crm_id
   
   
   ** Behaviour **
   
   Given the example immediately above the output mimic current outputs plus 
the list of configured fields above. In this case, new field  called 
their_crm_id that contains the value of the reference side crm_id at all times 
(i.e. regardless of identical, deleted, changed) if the row is new (i.e. only 
exists on the ERP side) the value in turn will be null.
   
   ** Why this matters **
   
   This unlocks Merge-Diff as a true CDC orchestration engine without forcing 
users to:
   - Re-join back to the reference system
   - maintain shadow lookup tables
   
   It also significantly improves:
   - Performance
   - Reliability
   - Data correctness
   
   <img width="1667" height="1033" alt="Image" 
src="https://github.com/user-attachments/assets/9ebf4674-4f9c-4460-aa4b-a02509bd1a48";
 />
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: Hop Gui


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to