Thinking about this more, there's the potential for a nice big project
independent of all our Ledger implementations, to deal with external data.
Here's the idea, five components of a single project:

- "Fetching": code that can automatically obtain the data by connecting to
various data sources. The ledger-autosync attempts to do this using
ofxclient for institutions that support OFX. This could include a scraping
component for other institutions.

- "Recognition": given a filename and its contents, automatically guess
which institution and account it is for. Beancount's import package deals
with this by allowing the user to specify a list of regexps that the file
must match. I'm not entirely sure this can always be done irrespective of
the user, as the account-id is often a required part of a regexp, but it
might. This is used to automate "figuring out what to do" given a bunch of
downloaded files in a directory, a great convenience.  There is some code
in ledger-autosync and the beancount.sources Python package.

- "Extraction": parse the file, CSV or OFX or otherwise, and extract a list
of double-entry transactions data structures from it in some sort of
generic internal format, independent of Ledger / HLedger / Beancount /
other.  The Reckon project aims to do this for CSV files.

- "Export": convert the internal transactions data structure to the syntax
of one particular double-entry language implementation, Ledger or other.
This spits out text.

- "Filing": given the same files as for step 4 / extraction, figure out
which Ledger account they correspond to and automatically sanitize the
filenames, extract and add the date into it, and move them in a directory
hierarchy corresponding to each account.

Beancount's import code deals with steps 2, 3, 4, 5, but frankly I would
much rather that code live in an external project shared with others. I'm
thinking about forking it out and starting a new codebase for it.



On Fri, Jan 24, 2014 at 9:57 AM, Martin Blais <[email protected]> wrote:

> These would be better done in two separate steps IMHO:
>
> 1. extract the data from whichever external source format (e.g. OFX) into
> an internal transaction data structure
> 2. "complete" incomplete imported transaction objects by adding missing
> legs using the past Ledger history
>
> About (1): CSV files are pretty rare. The only ones I've come across (in
> my own little bubble of a world) are PayPal, OANDA, and Ameritrade. Much
> more common for banks, investment and credit card companies is OFX and
> Quicken files. I also find it convenient to recognize at least *some* data
> from PDF files, such as the date of a statement, for automatic
> classification and filing into a folder (you could apply machine learning
> to this problem, i.e. give a whole bunch of disorganized words from what is
> largely imperfect PDF to text conversion, classify which statement it is,
> but crafting a few regexps by hand has proved to work quite well so far).
>  I'll add anomyfied example input files to Beancount for automated testing
> at some point, they'll be going here:
> https://hg.furius.ca/public/beancount/file/tip/src/python/beancount/sources
>
> I'm thinking.... maybe it would make sense for importers (mine and/or
> yours) to spit out some sort of XML/JSON format that could be converted
> into either Ledger of Beancount syntax or whatever else? This way all those
> importers could be farmed out to another project and reused by users of
> various accounting software. Does this make sense?
>
> About (2): If Ledger supports input'ing incomplete transactions, you could
> do this without relying on CSV conversion, that would be much more
> reusable. In Beancount, my importers are allowed to create invalid
> transaction objects, and I plan to put in a simple little perceptron
> function that should do a good enough job of adding missing legs
> automatically (one might call this "automatic categorization"),
> independently of input data format.
>
> Just some ideas,
>
>
>
>
> On Fri, Jan 24, 2014 at 4:55 AM, Edwin van Leeuwen <[email protected]>wrote:
>
>> Hi all,
>>
>> Reckon needs your help :)
>>
>> Reckon automagically converts CSV files for use with the command-line
>> accounting tool Ledger. It also helps you to select the correct
>> accounts associated with the CSV data using Bayesian machine learning.
>> For more information see:
>>
>> http://blog.andrewcantino.com/blog/2010/11/06/command-line-accounting-with-ledger-and-reckon/
>>
>> We would like to expand reckon's ability to automagically convert csv
>> files. It already supports quite a few formats, but we are interested
>> in taking this further. For that we need more csv examples, so that we
>> can make sure those are correctly detected and especially make sure no
>> mistakes are made. You could really help us out by sending us
>> (anonimized) csv files as produced by your bank. We'd add those
>> examples to our test suite and make sure it all works well. Ideally,
>> we'd need a csv file containing a minimum of 5 transactions.
>>
>> The formats currently in the test suite are here:
>>
>> https://github.com/cantino/reckon/blob/master/spec/reckon/csv_parser_spec.rb#L207
>>
>> Full disclosure: I am not the original author, but have been
>> contributing code to make it correctly convert my csv files :)
>>
>> Cheers, Edwin
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Ledger" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to