Re: Reckon wants your csv files

Martin Blais Mon, 03 Mar 2014 08:18:05 -0800

On Mon, Mar 3, 2014 at 10:35 AM, Edwin van Leeuwen <[email protected]>wrote:


> Just a thought on the internal format of a library: I would be tempted
> to use OFX as an internal format and then from there to
> ledger/beancount format. This is because OFX is a well defined format,
> so should hold any kind of financial data without problems. This will
> also make it easier for other tools to adopt, because they might
> already have an OFX import function.
>

I don't think it's a good idea. OFX is really messy, and appears to allow
for a lot more interpretation than one would like. Very soon in ledgerhub
I'm going to check in example of OFX statements from various institutions
and you can have a look for yourself. (I'm trying to extract the code from
beancount at the moment but I want to do this right, so it'll take another
week or two for the codebase to come up.)







>
> Kind regards,
>
> Edwin
>
>
>
>
> On 27 February 2014 19:00, AMaffei <[email protected]> wrote:
> > Thanks Martin.
> >
> > One thing I'll comment on here is my preference of a CSV file (instead of
> > creation of an internal data structure) as an output of the "Extraction"
> > phase. My intent is to make the system more scalable. Edwin is currently
> > collecting lots of different CSV files generated from many different
> sources
> > and incorporating their translation into Rekon. HIs efforts can only
> scale
> > so far. I'm constantly amazed at the format and content of CSV exports I
> run
> > into.
> >
> > A company that generates a custom CSV or a 3rd party might someday
> provide a
> > service and/or code (in whatever language they prefer) to translate their
> > custom-CSV format into a Ledger-Hub-compatible CSV that would be ingested
> > into the later stages of Ledger-Hub.
> >
> > I'll see if I can come up with a draft spec for such a CSV after I read
> your
> > Google Doc and comment on it.  "How nice to see that! Thanks.
> >
> > -- Andy
> >
> > On Thursday, February 27, 2014 11:45:50 AM UTC-5, Martin Blais wrote:
> >>
> >> Hi Andy,
> >> This thread has been sitting in my inbox for a while, waiting for me to
> >> reply with the following. (I was hoping to get the project ready before
> >> sending the link below but I've been too busy to get it done by now.)
> >>
> >> I'm in the process of forking out my Beancount import code into a new
> >> project, which will be called "ledgerhub," and which will be free of
> >> Beancount code dependencies, i.e. it should work for it, Ledger and any
> >> other similar implementations.
> >>
> >> Here is the design doc for it:
> >>
> >> Design Doc for LedgerHub
> >>
> >>
> https://docs.google.com/document/d/11u1sWv7H7Ykbc7ayS4M9V3yKqcuTY7LJ3n1tgnEN2Hk/edit?usp=sharing
> >>
> >> Please (anyone) feel free to comment in the margins (right-click ->
> >> Comment...).
> >>
> >> More comments below.
> >>
> >>
> >> On Thu, Feb 27, 2014 at 11:16 AM, AMaffei <[email protected]> wrote:
> >>>
> >>> Martin,
> >>>
> >>> I really like the idea of a staged system, perhaps with a set of
> programs
> >>> and drivers (see below).
> >>>
> >>> I'd be interested in helping with a project along these lines.
> >>> Unfortunately my programming skills are rusty, but I work with a
> colleague
> >>> who might help out.
> >>>
> >>> My own processing approach is similar to yours. Apologies for length
> and
> >>> detail level. I have not looked at Rekon in detail yet so perhaps some
> of
> >>> these ideas are already employed in other manners. My comments on each
> stage
> >>> (and one of my own added) below...
> >>>
> >>> --Andy
> >>>
> >>>
> >>> On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote:
> >>>>
> >>>> Thinking about this more, there's the potential for a nice big project
> >>>> independent of all our Ledger implementations, to deal with external
> data.
> >>>> Here's the idea, five components of a single project:
> >>>
> >>> - thanks for dissecting things so nicely.
> >>
> >>
> >> I've added more detail in the deisgn doc.
> >>
> >>
> >>>
> >>>
> >>>>
> >>>> - "Fetching": code that can automatically obtain the data by
> connecting
> >>>> to various data sources. The ledger-autosync attempts to do this using
> >>>> ofxclient for institutions that support OFX. This could include a
> scraping
> >>>> component for other institutions.
> >>>
> >>> - the output of this stage would be a number of files of different
> >>> formats -- OFX, a spectrum of CSV file formats, and others.
> >>
> >>
> >> Yes.
> >>
> >>
> >>>>
> >>>> - "Recognition": given a filename and its contents, automatically
> guess
> >>>> which institution and account it is for. Beancount's import package
> deals
> >>>> with this by allowing the user to specify a list of regexps that the
> file
> >>>> must match. I'm not entirely sure this can always be done
> irrespective of
> >>>> the user, as the account-id is often a required part of a regexp, but
> it
> >>>> might. This is used to automate "figuring out what to do" given a
> bunch of
> >>>> downloaded files in a directory, a great convenience.  There is some
> code in
> >>>> ledger-autosync and the beancount.sources Python package.
> >>>
> >>> - I really like the approach CSV2Ledger takes with it's
> FileMatches.yaml
> >>> (https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml)
> file.
> >>> I think defining a spec for FileMatches.yaml that either Perl, Python,
> or
> >>> whatever code could employ for following stages might be worthwhile.
> >>> Filematches.yaml (or the equivalent) would provide key information for
> >>> future processing stages of files from different sources. If CSV files
> then
> >>> information about field-separators, field-names, a reg-ex for "real"
> >>> records, etc. can be specified here. The result of "Recognition" would
> be to
> >>> pass the file off to a customized driver (see my next comment).
> >>
> >>
> >> My approach is similar to this, with the regepxs, see the example code
> >> bits in the Identification section of my document, or the example
> importer
> >> file in my source code:
> >>
> >>
> https://hg.furius.ca/public/beancount/file/a91a44f466a1/examples/importing/importing.import
> >>
> >> One could imagine creating instances of a more generic "CSV importer"
> that
> >> could take its configuration of which field maps to what. In my
> experience,
> >> each source has peculiarities beyond this and requires custom code, so
> >> that's the approach I've taken so far, but nothing would prevent the
> >> inclusion of such an importer in the system I propose.
> >>
> >>
> >>
> >>>> - "Extraction": parse the file, CSV or OFX or otherwise, and extract a
> >>>> list of double-entry transactions data structures from it in some
> sort of
> >>>> generic internal format, independent of Ledger / HLedger / Beancount /
> >>>> other.  The Reckon project aims to do this for CSV files.
> >>>
> >>> - I suggest employing small driver programs, written by others, that
> >>> ingest custom formats. The path to the appropriate driver program
> would be
> >>> included in the FileMatches.yaml file (or it's equivalent).  These
> drivers
> >>> would ingest files output by "Fetching" stage and generate the "generic
> >>> internal format" you mention. However, In support of flexibility I
> suggest
> >>> that the result of this stage be a CSV file, that we strictly specify
> the
> >>> format of, that would be processed by the next stage.
> >>
> >>
> >> Ha... why do you like CSV for an internal data format? I was thinking
> that
> >> this data structure wouldn't even go to a file, it could just be some
> Python
> >> tuples/namedtuples.  We could indeed define an intermediate format, but
> I
> >> can't really see when that would be needed.  Ideally, no edits at ths
> stage
> >> would be necessary by the user so I don't see a need to output that to
> file.
> >>
> >> About programs: There's really only two separate program/steps I take so
> >> far: 1. "import" which generates text that I append to my ledger file,
> 2.
> >> "file" which moves the files into a directory hierarchy. Both of these
> >> programs put together many of the steps described in the document and I
> >> haven't found a need to separate them so much so far, except for
> debugging.
> >> Do you need them all separated?  That could be done.
> >>
> >>
> >>
> >>> - I add an additional stage here I'll call "AccountAssignment". I
> examine
> >>> several fields of the imported record (things like employeeID,
> PONumber,
> >>> etc. that are associated with the transaction) to determine which DEB
> >>> account name to assign it to. Account names for all DEB systems should
> be
> >>> hierarchical so that could still be done in a DEB-software-agnostic
> manner.
> >>> A more sophisticated version CSV2Ledger's PreProcess.yaml
> >>> (https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml)
> could
> >>> help drive this stage. The output of this stage is the same CSV as
> above
> >>> with a "DEBAccount" field appended to each record.
> >>
> >>
> >> I do the same thing :-)  The way I weave this into my importers is that
> >> each importer in the configuration file defines a dictionary of required
> >> configuration variables, almost all of which are account names. When the
> >> importer creates the normalized transaction objects, it uses the account
> >> names from its configuration.
> >>
> >>
> >>
> >>>>
> >>>>
> >>>> - "Export": convert the internal transactions data structure to the
> >>>> syntax of one particular double-entry language implementation, Ledger
> or
> >>>> other. This spits out text.
> >>>
> >>> - I once again like the approach of CSV2Ledger.pl (see source code at
> >>> https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138).
> It
> >>> allows for the FileMatches.yaml file to include a variable called
> >>> TxnOutputTemplate that specifies how to setup the ledger-cli
> transaction in
> >>> your journal file. A similar templating approach could be used for
> other
> >>> double-entry language file formats.
> >>
> >>
> >> That's an interesting idea.  That will be done, it would be flexible,
> esp.
> >> for Ledger output.
> >>
> >> For Beancount target output, the code base has functions to convert
> >> transactions back into text into its current text input format, and
> those
> >> can change, I had planned to use those. The new Beancount input syntax
> is a
> >> lot less flexible, so it's not as necessary to provide options for it.
>  I'll
> >> add this to the design doc.
> >>
> >> Thanks for your comments.
> >> Please leave more on the doc.
> >>
> >> I'll try to fork out my current import code as soon as possible so
> others
> >> can contribute.
> >>
> >>
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "Ledger" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Ledger" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Reckon wants your csv files

Reply via email to