Re: Deduplication & merging questions

Marvin Ritter Sat, 05 Aug 2023 06:34:24 -0700

I have a little plugin (attached) that looks for transactions between my
accounts and marks similar ones as duplicates (and removes one). It can
handle some small differences and has done a surprisingly good job for me.


The plugin works better when you keep separate bean files per Asset account
(which I do).
For avoiding re-importing transactions my Importers look for the last
transaction in the corresponding bean file and only add newer transactions.
This assumes that transactions don't change after their date. Occasionally
this leads to errors if companies (e.g. car rental, hotel) blocked money on
my debit card, I run an import and then they release it. But overall it
works well. For a while I tried writing heuristics to distinguish my manual
changes (which I want to keep) and upstream changes. Even having unique IDs
didn't make this easy.


On Sun, Jul 30, 2023 at 5:40 PM Martin Blais <[email protected]> wrote:

> One of the things that was never done is to specify deduplicating by
> import source, which would make a lot of sense.
> Some institutions, e.g. Amex, have really nice unique ids on each
> transaction that can be used to dedup exactly (if preserved).  Some don't.
> The heuristics I'm using today are imperfect... this needs improvement.
>
>
> On Tue, Jul 18, 2023 at 6:22 AM Eric Altendorf <[email protected]>
> wrote:
>
>> I see hooks for dup deduction on importers, but the doc comments don't
>> make it clear how those functions are used.  Poking around the code, it
>> appears that these are only run to dedup items within a single import.
>>
>> Is there any functionality to automatically match up legs of a
>> transaction that come from different importers, e.g., a transfer from one
>> account to another?
>>
>> thanks,
>> ericc
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Beancount" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/beancount/CAFXPr0vp0MtWcDr0ueMuvXFREO5G1oTGC0UvD%2BRXD%2BdbLG9D%3Dg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/beancount/CAFXPr0vp0MtWcDr0ueMuvXFREO5G1oTGC0UvD%2BRXD%2BdbLG9D%3Dg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Beancount" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/beancount/CAK21%2BhM2Sexk%3D%3DZ5BQQxK67L9Xbx%2BFAE4%3D-mz9jO%2BjNSPYnJow%40mail.gmail.com
> <https://groups.google.com/d/msgid/beancount/CAK21%2BhM2Sexk%3D%3DZ5BQQxK67L9Xbx%2BFAE4%3D-mz9jO%2BjNSPYnJow%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/CAPytOJHfcnE4__%2B16Ft5A1xe3cgQL6yamQfX%2BsNGia3xgC9bAg%40mail.gmail.com.

"""Plugin to remove transactions that exists in ledgers of multiple asset accounts.

Assumptions:
- Internal accounts are assets, liabilities and equity.
- Transactions for each internal account are stored in separate files.
"""


from datetime import timedelta

from beancount.core import data


__plugins__ = ["deduplicate"]


_INTERNAL_ACCOUNT_KINDS = frozenset(["Assets", "Liabilities", "Equity"])


def _is_internal_transaction(entry) -> bool:
    """A transaction is considered internal if at least 2 postings are on internal accounts."""
    if not isinstance(entry, data.Transaction):
        return False
    postings_on_interal_accounts = [
        p for p in entry.postings if p.account.split(":")[0] in _INTERNAL_ACCOUNT_KINDS
    ]
    return len(postings_on_interal_accounts) > 1


def _extracted_from_same_file(t1: data.Transaction, t2: data.Transaction) -> bool:
    id1 = t1.meta.get("id", t1.meta.get("filename")) or "a:0"
    id2 = t2.meta.get("id", t2.meta.get("filename")) or "b:0"
    return id1.split(":")[0] == id2.split(":")[0]


def _is_subset_of(t1, t2):
    # Decimal.normalize() did not work because beancount something keeps wrong
    # representation.
    ps1 = {f"{p.account} {p.units.number:.9f} {p.units.currency}" for p in t1.postings}
    ps2 = {f"{p.account} {p.units.number:.9f} {p.units.currency}" for p in t2.postings}
    return ps1.issubset(ps2)


def _similar_date_but_different_file(index, transactions) -> list[data.Transaction]:
    result = []
    t1 = transactions[index]
    for j in range(index - 1, -1, -1):
        t2 = transactions[j]
        if t2 is None or _extracted_from_same_file(t1, t2):
            continue
        if abs(t1.date - t2.date) > timedelta(days=7):
            break
        result.append((j, t2))
    for j in range(index + 1, len(transactions)):
        t2 = transactions[j]
        if t2 is None or _extracted_from_same_file(t1, t2):
            continue
        if abs(t1.date - t2.date) > timedelta(days=7):
            break
        result.append((j, t2))
    return result


def deduplicate(entries, options_map):
    del options_map
    # `entries` without duplicates.
    unique_entries = []

    # We look for duplicates in internal transaction. Expenses and income accounts are
    # considered external and a transaction is considered internal if it has more than 1
    # internal account (assets, equity, liabilities).
    internal_transactions = []
    for entry in entries:
        if _is_internal_transaction(entry):
            internal_transactions.append(entry)
        else:
            unique_entries.append(entry)

    # Sort by date.
    # internal_transactions = sorted(internal_transactions, key=lambda t: t.date)

    # For each internal transaction we look for similar transactions around it.
    # To be consider duplicates transaction must be no more than 7 days apart, come from different files
    # and postings in one transaction must be a subset of the transactions in the other transactions.
    for i, t1 in enumerate(internal_transactions):
        if t1 is None:
            continue
        candidates = _similar_date_but_different_file(i, internal_transactions)
        for j, t2 in candidates:
            if _is_subset_of(t1, t2):
                filename, lineno = t1.meta["filename"], t1.meta["lineno"]
                t2.meta["duplicate_of"] = t1.meta.get("id", f"{filename}:{lineno}")
                unique_entries.append(t2)
                internal_transactions[i] = None
                internal_transactions[j] = None
                break
    unique_entries.extend([t for t in internal_transactions if t is not None])
    return unique_entries, []

Re: Deduplication & merging questions

Reply via email to