Re: Reckon wants your csv files

Martin Blais Thu, 27 Feb 2014 08:46:29 -0800

Hi Andy,
This thread has been sitting in my inbox for a while, waiting for me to
reply with the following. (I was hoping to get the project ready before
sending the link below but I've been too busy to get it done by now.)

I'm in the process of forking out my Beancount import code into a new
project, which will be called "ledgerhub," and which will be free of
Beancount code dependencies, i.e. it should work for it, Ledger and any
other similar implementations.

Here is the design doc for it:

Design Doc for LedgerHub
https://docs.google.com/document/d/11u1sWv7H7Ykbc7ayS4M9V3yKqcuTY7LJ3n1tgnEN2Hk/edit?usp=sharing

Please (anyone) feel free to comment in the margins (right-click ->
Comment...).

More comments below.

On Thu, Feb 27, 2014 at 11:16 AM, AMaffei <[email protected]> wrote:

> Martin,
>
> I really like the idea of a staged system, perhaps with a set of programs
> and drivers (see below).
>
> I'd be interested in helping with a project along these lines.
> Unfortunately my programming skills are rusty, but I work with a colleague
> who might help out.
>
> My own processing approach is similar to yours. Apologies for length and
> detail level. I have not looked at Rekon in detail yet so perhaps some of
> these ideas are already employed in other manners. My comments on each
> stage (and one of my own added) below...
>
> --Andy
>
>
> On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote:
>>
>> Thinking about this more, there's the potential for a nice big project
>> independent of all our Ledger implementations, to deal with external data.
>> Here's the idea, five components of a single project:
>>
> - thanks for dissecting things so nicely.
>

I've added more detail in the deisgn doc.

>
>
>> - "Fetching": code that can automatically obtain the data by connecting
>> to various data sources. The ledger-autosync attempts to do this using
>> ofxclient for institutions that support OFX. This could include a scraping
>> component for other institutions.
>>
> - the output of this stage would be a number of files of different formats
> -- OFX, a spectrum of CSV file formats, and others.
>

Yes.

> - "Recognition": given a filename and its contents, automatically guess
>> which institution and account it is for. Beancount's import package deals
>> with this by allowing the user to specify a list of regexps that the file
>> must match. I'm not entirely sure this can always be done irrespective of
>> the user, as the account-id is often a required part of a regexp, but it
>> might. This is used to automate "figuring out what to do" given a bunch of
>> downloaded files in a directory, a great convenience.  There is some code
>> in ledger-autosync and the beancount.sources Python package.
>>
> - I really like the approach CSV2Ledger takes with it's FileMatches.yaml (
> https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml)
> file.  I think defining a spec for FileMatches.yaml that either Perl,
> Python, or whatever code could employ for following stages might be
> worthwhile. Filematches.yaml (or the equivalent) would provide key
> information for future processing stages of files from different sources.
> If CSV files then information about field-separators, field-names, a reg-ex
> for "real" records, etc. can be specified here. The result of "Recognition"
> would be to pass the file off to a customized driver (see my next comment).
>

My approach is similar to this, with the regepxs, see the example code bits
in the Identification section of my document, or the example importer file
in my source code:
https://hg.furius.ca/public/beancount/file/a91a44f466a1/examples/importing/importing.import

One could imagine creating instances of a more generic "CSV importer" that
could take its configuration of which field maps to what. In my experience,
each source has peculiarities beyond this and requires custom code, so
that's the approach I've taken so far, but nothing would prevent the
inclusion of such an importer in the system I propose.

- "Extraction": parse the file, CSV or OFX or otherwise, and extract a list
>> of double-entry transactions data structures from it in some sort of
>> generic internal format, independent of Ledger / HLedger / Beancount /
>> other.  The Reckon project aims to do this for CSV files.
>>
> - I suggest employing small driver programs, written by others, that
> ingest custom formats. The path to the appropriate driver program would be
> included in the FileMatches.yaml file (or it's equivalent).  These drivers
> would ingest files output by "Fetching" stage and generate the "generic
> internal format" you mention. However, In support of flexibility I suggest
> that the result of this stage be a CSV file, that we strictly specify the
> format of, that would be processed by the next stage.
>

Ha... why do you like CSV for an internal data format? I was thinking that
this data structure wouldn't even go to a file, it could just be some
Python tuples/namedtuples.  We could indeed define an intermediate format,
but I can't really see when that would be needed.  Ideally, no edits at ths
stage would be necessary by the user so I don't see a need to output that
to file.

About programs: There's really only two separate program/steps I take so
far: 1. "import" which generates text that I append to my ledger file, 2.
"file" which moves the files into a directory hierarchy. Both of these
programs put together many of the steps described in the document and I
haven't found a need to separate them so much so far, except for debugging.
 Do you need them all separated?  That could be done.

- I add an additional stage here I'll call "AccountAssignment". I examine
> several fields of the imported record (things like employeeID, PONumber,
> etc. that are associated with the transaction) to determine which DEB
> account name to assign it to. Account names for all DEB systems should be
> hierarchical so that could still be done in a DEB-software-agnostic manner.
> A more sophisticated version CSV2Ledger's PreProcess.yaml (
> https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml) could
> help drive this stage. The output of this stage is the same CSV as above
> with a "DEBAccount" field appended to each record.
>

I do the same thing :-)  The way I weave this into my importers is that
each importer in the configuration file defines a dictionary of required
configuration variables, almost all of which are account names. When the
importer creates the normalized transaction objects, it uses the account
names from its configuration.

>
>> - "Export": convert the internal transactions data structure to the
>> syntax of one particular double-entry language implementation, Ledger or
>> other. This spits out text.
>>
> - I once again like the approach of CSV2Ledger.pl (see source code at
> https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138).
> It allows for the FileMatches.yaml file to include a variable called
> TxnOutputTemplate that specifies how to setup the ledger-cli transaction in
> your journal file. A similar templating approach could be used for other
> double-entry language file formats.
>

That's an interesting idea.  That will be done, it would be flexible, esp.
for Ledger output.

For Beancount target output, the code base has functions to convert
transactions back into text into its current text input format, and those
can change, I had planned to use those. The new Beancount input syntax is a
lot less flexible, so it's not as necessary to provide options for it.
 I'll add this to the design doc.

Thanks for your comments.
Please leave more on the doc.

I'll try to fork out my current import code as soon as possible so others
can contribute.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Reckon wants your csv files

Reply via email to