On Mon, Mar 3, 2014 at 10:35 AM, Edwin van Leeuwen <[email protected]>wrote:
> Just a thought on the internal format of a library: I would be tempted > to use OFX as an internal format and then from there to > ledger/beancount format. This is because OFX is a well defined format, > so should hold any kind of financial data without problems. This will > also make it easier for other tools to adopt, because they might > already have an OFX import function. > I don't think it's a good idea. OFX is really messy, and appears to allow for a lot more interpretation than one would like. Very soon in ledgerhub I'm going to check in example of OFX statements from various institutions and you can have a look for yourself. (I'm trying to extract the code from beancount at the moment but I want to do this right, so it'll take another week or two for the codebase to come up.) > > Kind regards, > > Edwin > > > > > On 27 February 2014 19:00, AMaffei <[email protected]> wrote: > > Thanks Martin. > > > > One thing I'll comment on here is my preference of a CSV file (instead of > > creation of an internal data structure) as an output of the "Extraction" > > phase. My intent is to make the system more scalable. Edwin is currently > > collecting lots of different CSV files generated from many different > sources > > and incorporating their translation into Rekon. HIs efforts can only > scale > > so far. I'm constantly amazed at the format and content of CSV exports I > run > > into. > > > > A company that generates a custom CSV or a 3rd party might someday > provide a > > service and/or code (in whatever language they prefer) to translate their > > custom-CSV format into a Ledger-Hub-compatible CSV that would be ingested > > into the later stages of Ledger-Hub. > > > > I'll see if I can come up with a draft spec for such a CSV after I read > your > > Google Doc and comment on it. "How nice to see that! Thanks. > > > > -- Andy > > > > On Thursday, February 27, 2014 11:45:50 AM UTC-5, Martin Blais wrote: > >> > >> Hi Andy, > >> This thread has been sitting in my inbox for a while, waiting for me to > >> reply with the following. (I was hoping to get the project ready before > >> sending the link below but I've been too busy to get it done by now.) > >> > >> I'm in the process of forking out my Beancount import code into a new > >> project, which will be called "ledgerhub," and which will be free of > >> Beancount code dependencies, i.e. it should work for it, Ledger and any > >> other similar implementations. > >> > >> Here is the design doc for it: > >> > >> Design Doc for LedgerHub > >> > >> > https://docs.google.com/document/d/11u1sWv7H7Ykbc7ayS4M9V3yKqcuTY7LJ3n1tgnEN2Hk/edit?usp=sharing > >> > >> Please (anyone) feel free to comment in the margins (right-click -> > >> Comment...). > >> > >> More comments below. > >> > >> > >> On Thu, Feb 27, 2014 at 11:16 AM, AMaffei <[email protected]> wrote: > >>> > >>> Martin, > >>> > >>> I really like the idea of a staged system, perhaps with a set of > programs > >>> and drivers (see below). > >>> > >>> I'd be interested in helping with a project along these lines. > >>> Unfortunately my programming skills are rusty, but I work with a > colleague > >>> who might help out. > >>> > >>> My own processing approach is similar to yours. Apologies for length > and > >>> detail level. I have not looked at Rekon in detail yet so perhaps some > of > >>> these ideas are already employed in other manners. My comments on each > stage > >>> (and one of my own added) below... > >>> > >>> --Andy > >>> > >>> > >>> On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote: > >>>> > >>>> Thinking about this more, there's the potential for a nice big project > >>>> independent of all our Ledger implementations, to deal with external > data. > >>>> Here's the idea, five components of a single project: > >>> > >>> - thanks for dissecting things so nicely. > >> > >> > >> I've added more detail in the deisgn doc. > >> > >> > >>> > >>> > >>>> > >>>> - "Fetching": code that can automatically obtain the data by > connecting > >>>> to various data sources. The ledger-autosync attempts to do this using > >>>> ofxclient for institutions that support OFX. This could include a > scraping > >>>> component for other institutions. > >>> > >>> - the output of this stage would be a number of files of different > >>> formats -- OFX, a spectrum of CSV file formats, and others. > >> > >> > >> Yes. > >> > >> > >>>> > >>>> - "Recognition": given a filename and its contents, automatically > guess > >>>> which institution and account it is for. Beancount's import package > deals > >>>> with this by allowing the user to specify a list of regexps that the > file > >>>> must match. I'm not entirely sure this can always be done > irrespective of > >>>> the user, as the account-id is often a required part of a regexp, but > it > >>>> might. This is used to automate "figuring out what to do" given a > bunch of > >>>> downloaded files in a directory, a great convenience. There is some > code in > >>>> ledger-autosync and the beancount.sources Python package. > >>> > >>> - I really like the approach CSV2Ledger takes with it's > FileMatches.yaml > >>> (https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml) > file. > >>> I think defining a spec for FileMatches.yaml that either Perl, Python, > or > >>> whatever code could employ for following stages might be worthwhile. > >>> Filematches.yaml (or the equivalent) would provide key information for > >>> future processing stages of files from different sources. If CSV files > then > >>> information about field-separators, field-names, a reg-ex for "real" > >>> records, etc. can be specified here. The result of "Recognition" would > be to > >>> pass the file off to a customized driver (see my next comment). > >> > >> > >> My approach is similar to this, with the regepxs, see the example code > >> bits in the Identification section of my document, or the example > importer > >> file in my source code: > >> > >> > https://hg.furius.ca/public/beancount/file/a91a44f466a1/examples/importing/importing.import > >> > >> One could imagine creating instances of a more generic "CSV importer" > that > >> could take its configuration of which field maps to what. In my > experience, > >> each source has peculiarities beyond this and requires custom code, so > >> that's the approach I've taken so far, but nothing would prevent the > >> inclusion of such an importer in the system I propose. > >> > >> > >> > >>>> - "Extraction": parse the file, CSV or OFX or otherwise, and extract a > >>>> list of double-entry transactions data structures from it in some > sort of > >>>> generic internal format, independent of Ledger / HLedger / Beancount / > >>>> other. The Reckon project aims to do this for CSV files. > >>> > >>> - I suggest employing small driver programs, written by others, that > >>> ingest custom formats. The path to the appropriate driver program > would be > >>> included in the FileMatches.yaml file (or it's equivalent). These > drivers > >>> would ingest files output by "Fetching" stage and generate the "generic > >>> internal format" you mention. However, In support of flexibility I > suggest > >>> that the result of this stage be a CSV file, that we strictly specify > the > >>> format of, that would be processed by the next stage. > >> > >> > >> Ha... why do you like CSV for an internal data format? I was thinking > that > >> this data structure wouldn't even go to a file, it could just be some > Python > >> tuples/namedtuples. We could indeed define an intermediate format, but > I > >> can't really see when that would be needed. Ideally, no edits at ths > stage > >> would be necessary by the user so I don't see a need to output that to > file. > >> > >> About programs: There's really only two separate program/steps I take so > >> far: 1. "import" which generates text that I append to my ledger file, > 2. > >> "file" which moves the files into a directory hierarchy. Both of these > >> programs put together many of the steps described in the document and I > >> haven't found a need to separate them so much so far, except for > debugging. > >> Do you need them all separated? That could be done. > >> > >> > >> > >>> - I add an additional stage here I'll call "AccountAssignment". I > examine > >>> several fields of the imported record (things like employeeID, > PONumber, > >>> etc. that are associated with the transaction) to determine which DEB > >>> account name to assign it to. Account names for all DEB systems should > be > >>> hierarchical so that could still be done in a DEB-software-agnostic > manner. > >>> A more sophisticated version CSV2Ledger's PreProcess.yaml > >>> (https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml) > could > >>> help drive this stage. The output of this stage is the same CSV as > above > >>> with a "DEBAccount" field appended to each record. > >> > >> > >> I do the same thing :-) The way I weave this into my importers is that > >> each importer in the configuration file defines a dictionary of required > >> configuration variables, almost all of which are account names. When the > >> importer creates the normalized transaction objects, it uses the account > >> names from its configuration. > >> > >> > >> > >>>> > >>>> > >>>> - "Export": convert the internal transactions data structure to the > >>>> syntax of one particular double-entry language implementation, Ledger > or > >>>> other. This spits out text. > >>> > >>> - I once again like the approach of CSV2Ledger.pl (see source code at > >>> https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138). > It > >>> allows for the FileMatches.yaml file to include a variable called > >>> TxnOutputTemplate that specifies how to setup the ledger-cli > transaction in > >>> your journal file. A similar templating approach could be used for > other > >>> double-entry language file formats. > >> > >> > >> That's an interesting idea. That will be done, it would be flexible, > esp. > >> for Ledger output. > >> > >> For Beancount target output, the code base has functions to convert > >> transactions back into text into its current text input format, and > those > >> can change, I had planned to use those. The new Beancount input syntax > is a > >> lot less flexible, so it's not as necessary to provide options for it. > I'll > >> add this to the design doc. > >> > >> Thanks for your comments. > >> Please leave more on the doc. > >> > >> I'll try to fork out my current import code as soon as possible so > others > >> can contribute. > >> > >> > > -- > > > > --- > > You received this message because you are subscribed to the Google Groups > > "Ledger" group. > > To unsubscribe from this group and stop receiving emails from it, send an > > email to [email protected]. > > For more options, visit https://groups.google.com/groups/opt_out. > > -- > > --- > You received this message because you are subscribed to the Google Groups > "Ledger" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
