dmainou opened a new issue, #6369: URL: https://github.com/apache/hop/issues/6369
### What would you like to happen? **Summary** Add the ability to pass through selected fields from either the Reference or Comparison dataset into the Merge-Diff output, even when those fields are not part of the key or comparison set. This enables downstream systems (e.g. APIs) to receive required identifiers (like CRM IDs) without requiring slow, error-prone lookups after the diff. **Problem** In a typical ERP → CRM sync flow: - ERP is the Comparison stream - CRM is the Reference stream - Merge-Diff correctly identifies identical, changed, deleted and new rows. Also outputs a JSON with the list of changes and passes through columns that are neither a key or a comparison field. However: When a row is deleted or changed the data from the Reference side is partially lost as the Compared data is kept. You are able to easily retrieve values from the changes json but unable to determine what was the original data on fields that are there just for the ride. Fields like ID's (e.g CRM ID's) that for merging purposes have been created as nulls on the Comparison side (e.g. ERP). Source | product_name (key) | price (value) | crm_id -- | -- | -- | -- ERP (Comparison) | sample | 10.00 | Null CRM (Reference) | sample | 10.00 | 123 crm_id is required to form the CRM API update call, but: - It is not a key - It is not a compare field - Therefore it is dropped from the Merge-Diff output This forces users to perform a lookup back into CRM data which is slow, and at times has caused issues due to (duplicate names, bad data, etc.) ** Proposed Solution ** Add a new optional tab to the Merge-Diff transform: “Pass-Through Fields” This tab allows users to select fields that should always be included in the Merge-Diff output, regardless of diff outcome. UI Design A grid with 3 columns: Column | Description -- | -- Target dataset | Reference or Comparison Target field | Field name from that stream New field name |Name to appear in the output Target dataset | Target field | New field name -- | -- | -- Reference | crm_id | their_crm_id ** Behaviour ** Given the example immediately above the output mimic current outputs plus the list of configured fields above. In this case, new field called their_crm_id that contains the value of the reference side crm_id at all times (i.e. regardless of identical, deleted, changed) if the row is new (i.e. only exists on the ERP side) the value in turn will be null. ** Why this matters ** This unlocks Merge-Diff as a true CDC orchestration engine without forcing users to: - Re-join back to the reference system - maintain shadow lookup tables It also significantly improves: - Performance - Reliability - Data correctness <img width="1667" height="1033" alt="Image" src="https://github.com/user-attachments/assets/9ebf4674-4f9c-4460-aa4b-a02509bd1a48" /> ### Issue Priority Priority: 3 ### Issue Component Component: Hop Gui -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
