I'm surprised this hasn't gotten more attention, because in my experience 
this completely broke smart_importer for me. I couldn't get any matches 
from the training at all, until I dug around and found the same line you 
did. I still have it "fixed" locally via a direct edit to the file in my 
venv, which is totally the proper way to patch things right? 

Great PR, you did more investigating into the intent of the original code 
than I was able to. Would love to see it merged.
On Wednesday, March 5, 2025 at 11:06:10 PM UTC-8 Justus Pendleton wrote:

> I'm increasingly convinced this is a bug and submitted a PR to address it.
>
> https://github.com/beancount/beangulp/pull/159
>
> The deduplicate logic merges the newly imported & deduped entries into the 
> existing entries because it looks at the existing entries to decide what to 
> deduplicate. But handling it this way pollutes the existing_entries for any 
> hooks that get run later. As one other example of the problem: if you run 
> ML training on "existing_data" it is wrong (with varying degrees of 
> wrongness) because your newly imported entries (which btw only have a 
> single leg so aren't fully valid beancount yet) have been merged into the 
> training data.
>
> I think the right way to handle this is just track existing_entries and 
> "newly imported and deduped entries" separately for purposes of 
> deduplication.
>
> Cheers,
> Justus
>
> On Monday, March 3, 2025 at 2:31:20 PM UTC+10:30 Justus Pendleton wrote:
>
>> According to examples/import.py hooks take two parameters
>>
>> Args:
>>     extracted_entries_list: A list of (filename, entries) pairs, where
>> 'entries' are the directives extract from 'filename'.
>>      ledger_entries: If provided, a list of directives from the existing
>> ledger of the user. This is non-None if the user provided their
>> ledger file as an option.
>>
>> Returns:
>> A possibly different version of extracted_entries_list, a list of
>> (filename, entries), to be printed.
>>
>> But "ledger_entries" is never None -- that is, it is non-None even if a 
>> user didn't provide their ledger file as an option.
>>
>> This is because of the deduplicate logic in __init__.py/_extract which 
>> extends existing entries before calling hooks
>>
>>     # Deduplicate.
>>     for filename, entries, account, importer in extracted:
>>         importer.deduplicate(entries, existing_entries)
>>         existing_entries.extend(entries)
>>
>>     # Invoke hooks.
>>     for func in ctx.hooks:
>>         extracted = func(extracted, existing_entries)
>>
>> This was somewhat surprising to me (especially since it was contrary to 
>> the quasi-documentation/comment I quoted) as I wouldn't expect (or want) to 
>> have the newly imported entries merged into the existing entries before 
>> hooks have run.
>>
>> Is this intended? Is there another easy way to get the pristine set of 
>> entries from within a hook short of just running beancount.loader myself?
>>
>> My actual use case: I want to know the most recent Balance statement of 
>> an account in the ledger, which I am using as a proxy for "last imported 
>> date". But the most recent Balance statement I actually find will the one 
>> auto-generated by beangulp and then merged into existing_entries.
>>
>> Cheers,
>> Justus
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/beancount/0aa5588b-97d4-49a1-95fc-dba596a0db7cn%40googlegroups.com.

Reply via email to