Martin, I really like the idea of a staged system, perhaps with a set of programs and drivers (see below).
I'd be interested in helping with a project along these lines. Unfortunately my programming skills are rusty, but I work with a colleague who might help out. My own processing approach is similar to yours. Apologies for length and detail level. I have not looked at Rekon in detail yet so perhaps some of these ideas are already employed in other manners. My comments on each stage (and one of my own added) below... --Andy On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote: > > Thinking about this more, there's the potential for a nice big project > independent of all our Ledger implementations, to deal with external data. > Here's the idea, five components of a single project: > - thanks for dissecting things so nicely. > - "Fetching": code that can automatically obtain the data by connecting to > various data sources. The ledger-autosync attempts to do this using > ofxclient for institutions that support OFX. This could include a scraping > component for other institutions. > - the output of this stage would be a number of files of different formats -- OFX, a spectrum of CSV file formats, and others. > - "Recognition": given a filename and its contents, automatically guess > which institution and account it is for. Beancount's import package deals > with this by allowing the user to specify a list of regexps that the file > must match. I'm not entirely sure this can always be done irrespective of > the user, as the account-id is often a required part of a regexp, but it > might. This is used to automate "figuring out what to do" given a bunch of > downloaded files in a directory, a great convenience. There is some code > in ledger-autosync and the beancount.sources Python package. > - I really like the approach CSV2Ledger takes with it's FileMatches.yaml (https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml) file. I think defining a spec for FileMatches.yaml that either Perl, Python, or whatever code could employ for following stages might be worthwhile. Filematches.yaml (or the equivalent) would provide key information for future processing stages of files from different sources. If CSV files then information about field-separators, field-names, a reg-ex for "real" records, etc. can be specified here. The result of "Recognition" would be to pass the file off to a customized driver (see my next comment). > - "Extraction": parse the file, CSV or OFX or otherwise, and extract a > list of double-entry transactions data structures from it in some sort of > generic internal format, independent of Ledger / HLedger / Beancount / > other. The Reckon project aims to do this for CSV files. > - I suggest employing small driver programs, written by others, that ingest custom formats. The path to the appropriate driver program would be included in the FileMatches.yaml file (or it's equivalent). These drivers would ingest files output by "Fetching" stage and generate the "generic internal format" you mention. However, In support of flexibility I suggest that the result of this stage be a CSV file, that we strictly specify the format of, that would be processed by the next stage. - I add an additional stage here I'll call "AccountAssignment". I examine several fields of the imported record (things like employeeID, PONumber, etc. that are associated with the transaction) to determine which DEB account name to assign it to. Account names for all DEB systems should be hierarchical so that could still be done in a DEB-software-agnostic manner. A more sophisticated version CSV2Ledger's PreProcess.yaml (https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml) could help drive this stage. The output of this stage is the same CSV as above with a "DEBAccount" field appended to each record. > > - "Export": convert the internal transactions data structure to the syntax > of one particular double-entry language implementation, Ledger or other. > This spits out text. > - I once again like the approach of CSV2Ledger.pl (see source code at https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138). It allows for the FileMatches.yaml file to include a variable called TxnOutputTemplate that specifies how to setup the ledger-cli transaction in your journal file. A similar templating approach could be used for other double-entry language file formats. > - "Filing": given the same files as for step 4 / extraction, figure out > which Ledger account they correspond to and automatically sanitize the > filenames, extract and add the date into it, and move them in a directory > hierarchy corresponding to each account. > > Beancount's import code deals with steps 2, 3, 4, 5, but frankly I would > much rather that code live in an external project shared with others. I'm > thinking about forking it out and starting a new codebase for it. > > > > On Fri, Jan 24, 2014 at 9:57 AM, Martin Blais <[email protected]<javascript:> > > wrote: > >> These would be better done in two separate steps IMHO: >> >> 1. extract the data from whichever external source format (e.g. OFX) into >> an internal transaction data structure >> 2. "complete" incomplete imported transaction objects by adding missing >> legs using the past Ledger history >> >> About (1): CSV files are pretty rare. The only ones I've come across (in >> my own little bubble of a world) are PayPal, OANDA, and Ameritrade. Much >> more common for banks, investment and credit card companies is OFX and >> Quicken files. I also find it convenient to recognize at least *some* data >> from PDF files, such as the date of a statement, for automatic >> classification and filing into a folder (you could apply machine learning >> to this problem, i.e. give a whole bunch of disorganized words from what is >> largely imperfect PDF to text conversion, classify which statement it is, >> but crafting a few regexps by hand has proved to work quite well so far). >> I'll add anomyfied example input files to Beancount for automated testing >> at some point, they'll be going here: >> >> https://hg.furius.ca/public/beancount/file/tip/src/python/beancount/sources >> >> I'm thinking.... maybe it would make sense for importers (mine and/or >> yours) to spit out some sort of XML/JSON format that could be converted >> into either Ledger of Beancount syntax or whatever else? This way all those >> importers could be farmed out to another project and reused by users of >> various accounting software. Does this make sense? >> >> About (2): If Ledger supports input'ing incomplete transactions, you >> could do this without relying on CSV conversion, that would be much more >> reusable. In Beancount, my importers are allowed to create invalid >> transaction objects, and I plan to put in a simple little perceptron >> function that should do a good enough job of adding missing legs >> automatically (one might call this "automatic categorization"), >> independently of input data format. >> >> Just some ideas, >> >> >> >> >> On Fri, Jan 24, 2014 at 4:55 AM, Edwin van Leeuwen >> <[email protected]<javascript:> >> > wrote: >> >>> Hi all, >>> >>> Reckon needs your help :) >>> >>> Reckon automagically converts CSV files for use with the command-line >>> accounting tool Ledger. It also helps you to select the correct >>> accounts associated with the CSV data using Bayesian machine learning. >>> For more information see: >>> >>> http://blog.andrewcantino.com/blog/2010/11/06/command-line-accounting-with-ledger-and-reckon/ >>> >>> We would like to expand reckon's ability to automagically convert csv >>> files. It already supports quite a few formats, but we are interested >>> in taking this further. For that we need more csv examples, so that we >>> can make sure those are correctly detected and especially make sure no >>> mistakes are made. You could really help us out by sending us >>> (anonimized) csv files as produced by your bank. We'd add those >>> examples to our test suite and make sure it all works well. Ideally, >>> we'd need a csv file containing a minimum of 5 transactions. >>> >>> The formats currently in the test suite are here: >>> >>> https://github.com/cantino/reckon/blob/master/spec/reckon/csv_parser_spec.rb#L207 >>> >>> Full disclosure: I am not the original author, but have been >>> contributing code to make it correctly convert my csv files :) >>> >>> Cheers, Edwin >>> >>> -- >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "Ledger" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> > -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
