Thinking about this more, there's the potential for a nice big project independent of all our Ledger implementations, to deal with external data. Here's the idea, five components of a single project:
- "Fetching": code that can automatically obtain the data by connecting to various data sources. The ledger-autosync attempts to do this using ofxclient for institutions that support OFX. This could include a scraping component for other institutions. - "Recognition": given a filename and its contents, automatically guess which institution and account it is for. Beancount's import package deals with this by allowing the user to specify a list of regexps that the file must match. I'm not entirely sure this can always be done irrespective of the user, as the account-id is often a required part of a regexp, but it might. This is used to automate "figuring out what to do" given a bunch of downloaded files in a directory, a great convenience. There is some code in ledger-autosync and the beancount.sources Python package. - "Extraction": parse the file, CSV or OFX or otherwise, and extract a list of double-entry transactions data structures from it in some sort of generic internal format, independent of Ledger / HLedger / Beancount / other. The Reckon project aims to do this for CSV files. - "Export": convert the internal transactions data structure to the syntax of one particular double-entry language implementation, Ledger or other. This spits out text. - "Filing": given the same files as for step 4 / extraction, figure out which Ledger account they correspond to and automatically sanitize the filenames, extract and add the date into it, and move them in a directory hierarchy corresponding to each account. Beancount's import code deals with steps 2, 3, 4, 5, but frankly I would much rather that code live in an external project shared with others. I'm thinking about forking it out and starting a new codebase for it. On Fri, Jan 24, 2014 at 9:57 AM, Martin Blais <[email protected]> wrote: > These would be better done in two separate steps IMHO: > > 1. extract the data from whichever external source format (e.g. OFX) into > an internal transaction data structure > 2. "complete" incomplete imported transaction objects by adding missing > legs using the past Ledger history > > About (1): CSV files are pretty rare. The only ones I've come across (in > my own little bubble of a world) are PayPal, OANDA, and Ameritrade. Much > more common for banks, investment and credit card companies is OFX and > Quicken files. I also find it convenient to recognize at least *some* data > from PDF files, such as the date of a statement, for automatic > classification and filing into a folder (you could apply machine learning > to this problem, i.e. give a whole bunch of disorganized words from what is > largely imperfect PDF to text conversion, classify which statement it is, > but crafting a few regexps by hand has proved to work quite well so far). > I'll add anomyfied example input files to Beancount for automated testing > at some point, they'll be going here: > https://hg.furius.ca/public/beancount/file/tip/src/python/beancount/sources > > I'm thinking.... maybe it would make sense for importers (mine and/or > yours) to spit out some sort of XML/JSON format that could be converted > into either Ledger of Beancount syntax or whatever else? This way all those > importers could be farmed out to another project and reused by users of > various accounting software. Does this make sense? > > About (2): If Ledger supports input'ing incomplete transactions, you could > do this without relying on CSV conversion, that would be much more > reusable. In Beancount, my importers are allowed to create invalid > transaction objects, and I plan to put in a simple little perceptron > function that should do a good enough job of adding missing legs > automatically (one might call this "automatic categorization"), > independently of input data format. > > Just some ideas, > > > > > On Fri, Jan 24, 2014 at 4:55 AM, Edwin van Leeuwen <[email protected]>wrote: > >> Hi all, >> >> Reckon needs your help :) >> >> Reckon automagically converts CSV files for use with the command-line >> accounting tool Ledger. It also helps you to select the correct >> accounts associated with the CSV data using Bayesian machine learning. >> For more information see: >> >> http://blog.andrewcantino.com/blog/2010/11/06/command-line-accounting-with-ledger-and-reckon/ >> >> We would like to expand reckon's ability to automagically convert csv >> files. It already supports quite a few formats, but we are interested >> in taking this further. For that we need more csv examples, so that we >> can make sure those are correctly detected and especially make sure no >> mistakes are made. You could really help us out by sending us >> (anonimized) csv files as produced by your bank. We'd add those >> examples to our test suite and make sure it all works well. Ideally, >> we'd need a csv file containing a minimum of 5 transactions. >> >> The formats currently in the test suite are here: >> >> https://github.com/cantino/reckon/blob/master/spec/reckon/csv_parser_spec.rb#L207 >> >> Full disclosure: I am not the original author, but have been >> contributing code to make it correctly convert my csv files :) >> >> Cheers, Edwin >> >> -- >> >> --- >> You received this message because you are subscribed to the Google Groups >> "Ledger" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
