On Sun, Aug 7, 2016 at 12:36 PM, Simon Michael <[email protected]> wrote:

> Using a checksum for deduplication won't handle identical CSV records
> well, right ? Those are unlikely with our usual banks but I think quite
> possible if you consider CSV data generally.
>
> Here's my recent plan for hledger.


Why reinvent the wheel?
Write a customer HLedger printer (very simple) and reuse all the Beancount
tools for importing.
See also:
https://bitbucket.org/blais/beancount/src/093c1ca595409bd86782f48a9abd915945484fe6/src/python/beancount/reports/convert_reports.py


Let reading CSV files work as it currently does, but add a separate import
> command which does some additional things: append the converted entries to
> the main journal file, and save a position marker in CSVFILE.lastimport.
> Also when reading, if the marker file is found, skip any CSV records before
> the marked position.
>
> This much is generic and could be used with any data format, but I think
> it makes sense mainly for "import" data which you want to move into a main
> journal, and which is somewhat sequential, eg CSV, OFX QIF. It's would be
> harder and is probably not needed for eg journal or timeclock files.
>
> For CSV specifically, I'm thinking the position marker will be the last
> CSV record processed. It could be elaborated later to consider timestamps,
> checksums etc. if needed.
>
>
>
>
>
> On 8/6/16 10:19 PM, Erik Hetzner wrote:
>
>> Hi Martin,
>>
>> On Sat, 06 Aug 2016 21:16:40 -0700,
>> Martin Blais <[email protected]> wrote:
>>
>>>
>>>
>>> Storing a checksum for the imported row suffers from the problem that if
>>> the user does not immediately copy the result of our conversion, it will
>>> not be imported further, it could get lost.
>>>
>>> Beancount cross-checks extracted transactions against the contents of its
>>> destination ledger, but because the user often massages the transactions
>>> it
>>> has to use heuristics in order to perform an approximate match to
>>> determine
>>> which transactions have already been seen. The heuristic I have in place
>>> doesn't work too well at the moment (but it could be improved easily to
>>> be
>>> honest).
>>>
>>> A better idea would be to store a unique tag computed from the checksum
>>> of
>>> the input row and to cross-check the imported transactions against that
>>> special tag. That uses both your insight around validating the input
>>> instead of the resulting transaction, and uses the ledger instead of a
>>> temporary cache. It's the best of both worlds.
>>>
>>
>> Thanks for the comment. I’m not sure the distinction that you are making
>> here.
>> What I do, and I admit I only thought it through for a few minutes as I
>> don’t
>> actually use Mint but just wanted a simple CSV format for examples - is:
>>
>> 1. Take the input key-value pairs for the row, e.g. Date=2016/01/10
>> 2. Sort by key
>> 3. Generate a string from the key-value pairs and calculate the MD5
>> checksum
>> 4. Check against a metadata value in ledger using the checksum,
>>    a. If the row has already been imported, do nothing
>>    b. If the row is new (no match), import it.
>>
>> Here is an example of a generated ledger transaction:
>>
>> 2016/08/02 Amazon
>>     ; csvid: mint.a7c028a73d76956453dab634e8e5bdc1
>>     1234                                      $29.99
>>     Expenses:Shopping                        -$29.99
>>
>> As you can see, the csvid metadata field is what we query against using
>> ledger
>> to see if the transaction is already present.
>>
>> Simlarly, Beancount has a powerful but admittedly immature CSV importer
>>> growing:
>>> https://bitbucket.org/blais/beancount/src/9f3377eb58fe9ec8cf
>>> ea8d9e3d56f2446d05592f/src/python/beancount/ingest/importers/csv.py
>>>
>>> I've switched to using this and CSV file formats whenever I have them
>>> available - banking, credit cards, 401k.
>>>
>>> I'd like to make a routine to try to auto-detect the columns eventually,
>>> at
>>> the moment, they must be configured when creating the importer
>>> configuration.
>>>
>>
>> Thanks for the pointer - it does look a lot more flexible than my
>> implementation.
>>
>> I decided it what simpler, for my needs, to require a new class for each
>> type of
>> CSV file. It was too much trouble to try to make it configurable. The
>> core code
>> handles reading the CSV file, deduplicating, and all of that. The CSV
>> class
>> simply implements a `convert(row)` method which returns a `Transaction`
>> data
>> structure. I hope that if others need to import a particular type of CSV
>> file,
>> e.g. from their bank, they can contribute that back to the project.
>>
>> best, Erik
>> --
>> Sent from my free software system <http://fsf.org/>.
>>
>>
>
> --
>
> --- You received this message because you are subscribed to the Google
> Groups "Ledger" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/beancount/CAK21%2BhNZ7yPWpa00B7wQ7w03zPP2UKbYumV%3DsB93M_sNatHcZQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to