Martin,

I really like the idea of a staged system, perhaps with a set of programs 
and drivers (see below).

I'd be interested in helping with a project along these lines. 
Unfortunately my programming skills are rusty, but I work with a colleague 
who might help out. 

My own processing approach is similar to yours. Apologies for length and 
detail level. I have not looked at Rekon in detail yet so perhaps some of 
these ideas are already employed in other manners. My comments on each 
stage (and one of my own added) below...

--Andy

On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote:
>
> Thinking about this more, there's the potential for a nice big project 
> independent of all our Ledger implementations, to deal with external data. 
> Here's the idea, five components of a single project:
>
- thanks for dissecting things so nicely.
 

> - "Fetching": code that can automatically obtain the data by connecting to 
> various data sources. The ledger-autosync attempts to do this using 
> ofxclient for institutions that support OFX. This could include a scraping 
> component for other institutions.
>
- the output of this stage would be a number of files of different formats 
-- OFX, a spectrum of CSV file formats, and others.
 

> - "Recognition": given a filename and its contents, automatically guess 
> which institution and account it is for. Beancount's import package deals 
> with this by allowing the user to specify a list of regexps that the file 
> must match. I'm not entirely sure this can always be done irrespective of 
> the user, as the account-id is often a required part of a regexp, but it 
> might. This is used to automate "figuring out what to do" given a bunch of 
> downloaded files in a directory, a great convenience.  There is some code 
> in ledger-autosync and the beancount.sources Python package.
>
- I really like the approach CSV2Ledger takes with it's FileMatches.yaml 
(https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml) file. 
 I think defining a spec for FileMatches.yaml that either Perl, Python, or 
whatever code could employ for following stages might be worthwhile. 
Filematches.yaml (or the equivalent) would provide key information for 
future processing stages of files from different sources. If CSV files then 
information about field-separators, field-names, a reg-ex for "real" 
records, etc. can be specified here. The result of "Recognition" would be 
to pass the file off to a customized driver (see my next comment).
 

> - "Extraction": parse the file, CSV or OFX or otherwise, and extract a 
> list of double-entry transactions data structures from it in some sort of 
> generic internal format, independent of Ledger / HLedger / Beancount / 
> other.  The Reckon project aims to do this for CSV files.
>
- I suggest employing small driver programs, written by others, that ingest 
custom formats. The path to the appropriate driver program would be 
included in the FileMatches.yaml file (or it's equivalent).  These drivers 
would ingest files output by "Fetching" stage and generate the "generic 
internal format" you mention. However, In support of flexibility I suggest 
that the result of this stage be a CSV file, that we strictly specify the 
format of, that would be processed by the next stage.

- I add an additional stage here I'll call "AccountAssignment". I examine 
several fields of the imported record (things like employeeID, PONumber, 
etc. that are associated with the transaction) to determine which DEB 
account name to assign it to. Account names for all DEB systems should be 
hierarchical so that could still be done in a DEB-software-agnostic manner. 
A more sophisticated version CSV2Ledger's PreProcess.yaml 
(https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml) could 
help drive this stage. The output of this stage is the same CSV as above 
with a "DEBAccount" field appended to each record.

>
> - "Export": convert the internal transactions data structure to the syntax 
> of one particular double-entry language implementation, Ledger or other. 
> This spits out text.
>
- I once again like the approach of CSV2Ledger.pl (see source code at 
https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138). It 
allows for the FileMatches.yaml file to include a variable called 
TxnOutputTemplate that specifies how to setup the ledger-cli transaction in 
your journal file. A similar templating approach could be used for other 
double-entry language file formats.
 

> - "Filing": given the same files as for step 4 / extraction, figure out 
> which Ledger account they correspond to and automatically sanitize the 
> filenames, extract and add the date into it, and move them in a directory 
> hierarchy corresponding to each account.
>
> Beancount's import code deals with steps 2, 3, 4, 5, but frankly I would 
> much rather that code live in an external project shared with others. I'm 
> thinking about forking it out and starting a new codebase for it.
>
>
>
> On Fri, Jan 24, 2014 at 9:57 AM, Martin Blais <[email protected]<javascript:>
> > wrote:
>
>> These would be better done in two separate steps IMHO:
>>
>> 1. extract the data from whichever external source format (e.g. OFX) into 
>> an internal transaction data structure
>> 2. "complete" incomplete imported transaction objects by adding missing 
>> legs using the past Ledger history
>>
>> About (1): CSV files are pretty rare. The only ones I've come across (in 
>> my own little bubble of a world) are PayPal, OANDA, and Ameritrade. Much 
>> more common for banks, investment and credit card companies is OFX and 
>> Quicken files. I also find it convenient to recognize at least *some* data 
>> from PDF files, such as the date of a statement, for automatic 
>> classification and filing into a folder (you could apply machine learning 
>> to this problem, i.e. give a whole bunch of disorganized words from what is 
>> largely imperfect PDF to text conversion, classify which statement it is, 
>> but crafting a few regexps by hand has proved to work quite well so far). 
>>  I'll add anomyfied example input files to Beancount for automated testing 
>> at some point, they'll be going here:
>>
>> https://hg.furius.ca/public/beancount/file/tip/src/python/beancount/sources
>>
>> I'm thinking.... maybe it would make sense for importers (mine and/or 
>> yours) to spit out some sort of XML/JSON format that could be converted 
>> into either Ledger of Beancount syntax or whatever else? This way all those 
>> importers could be farmed out to another project and reused by users of 
>> various accounting software. Does this make sense?
>>
>> About (2): If Ledger supports input'ing incomplete transactions, you 
>> could do this without relying on CSV conversion, that would be much more 
>> reusable. In Beancount, my importers are allowed to create invalid 
>> transaction objects, and I plan to put in a simple little perceptron 
>> function that should do a good enough job of adding missing legs 
>> automatically (one might call this "automatic categorization"), 
>> independently of input data format.
>>
>> Just some ideas,
>>
>>
>>
>>
>> On Fri, Jan 24, 2014 at 4:55 AM, Edwin van Leeuwen 
>> <[email protected]<javascript:>
>> > wrote:
>>
>>> Hi all,
>>>
>>> Reckon needs your help :)
>>>
>>> Reckon automagically converts CSV files for use with the command-line
>>> accounting tool Ledger. It also helps you to select the correct
>>> accounts associated with the CSV data using Bayesian machine learning.
>>> For more information see:
>>>
>>> http://blog.andrewcantino.com/blog/2010/11/06/command-line-accounting-with-ledger-and-reckon/
>>>
>>> We would like to expand reckon's ability to automagically convert csv
>>> files. It already supports quite a few formats, but we are interested
>>> in taking this further. For that we need more csv examples, so that we
>>> can make sure those are correctly detected and especially make sure no
>>> mistakes are made. You could really help us out by sending us
>>> (anonimized) csv files as produced by your bank. We'd add those
>>> examples to our test suite and make sure it all works well. Ideally,
>>> we'd need a csv file containing a minimum of 5 transactions.
>>>
>>> The formats currently in the test suite are here:
>>>
>>> https://github.com/cantino/reckon/blob/master/spec/reckon/csv_parser_spec.rb#L207
>>>
>>> Full disclosure: I am not the original author, but have been
>>> contributing code to make it correctly convert my csv files :)
>>>
>>> Cheers, Edwin
>>>
>>> --
>>>
>>> ---
>>> You received this message because you are subscribed to the Google 
>>> Groups "Ledger" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to