Re: A Query Language for Beancount

Martin Blais Sun, 19 Oct 2014 13:23:49 -0700

On Sun, Oct 19, 2014 at 4:38 AM, Wm... <[email protected]> wrote:

> Sat, 18 Oct 2014 13:30:15 <CAK21+hOon=Aph+bBTSMwxw-5p08-
> [email protected]>
> Martin Blais <[email protected]> wrote...
>
>  On Sat, Oct 18, 2014 at 12:45 PM, Wm... <[email protected]>
>> wrote:
>>
>>  Thu, 2 Oct 2014 23:37:09 <CAK21+hMx7wgt0imEtHp_WTFuYaqLTPJ+OwFHM+wjPS_
>>> [email protected]>
>>> Martin Blais <[email protected]> wrote...
>>>
>>> [not sub'd to hledger so dropped from ng list]
>>>
>>>  It's always important to me to state the reasons why this is not
>>> possible
>>>
>>>> as a regular table: filtering has to be operable on complete
>>>> transactions
>>>> (with all legs), and operations on inventories of commodities have to be
>>>> possible. These are the reasons that we cannot just spit out the data to
>>>> some SQLite table and then use that (I'd like to see someone argue the
>>>> opposite).
>>>>
>>>>
>>> Given that you seem to write your thoughts out I've missed the doc where
>>> you say what the reasons are it can't be done in SQLite (or postgres or
>>> mysql) tables.  Bear in mind that GnuCash uses any of the 3 mentioned dbs
>>> as storage methods alongside XML (cf plain text file).
>>>
>>>
>> I'm not so much interested in the storage method; a relational database of
>> XML can represent the internals of Beancount alright, that's easy,
>>
>
> ok

BTW I meant OR, not OF. A relational database, an XML file, a JSON file,
and many more are suitable formats for this simple dataset. Representing
the data is not the problem. Specifying outputs suitable to produce common
accounting reports (in the face of cost basis, multiple currencies) is the
minor challenge at hand.

>
>
>  the
>> issue is making simple queries on tree-structured data that produces
>> reports suitable for accounting, with operations on inventories of
>> positions.
>>
>
> I'm not sure I'm getting this.  Surely most of the objects (point in time,
> something, amount of that something, who had it before and who had it
> afterwards) are easily defined and represented with a bit of glue saying
> that-one-over-there and this-one-here belong together, let's call it a
> transaction ?
>

This is a simple and common database-related problem.

You have at least two tables:
One table contains information attached to the transaction.
The other table contains information attached to the posting.

Let's illustrate: let us simplify things grossly and say that transactions
have only

  (entry-id, date, flag, payee)

fields and that postings have only

  (posting-id, entry-id, account, amount, currency)

fields.

You can join the two tables on some id (easily doable) to obtain a table of
postings plus their corresponding entry data.
If so, you get rows of

  (entry-id, date, flag, payee, posting-id, account, amount, currency)
  ...

The rub here is that you have duplication of the entry information. For
example, given the following transaction:

  2014-09-12 * "WALMART"
    Assets:Checking    -45.00 USD
    Expenses:Toys    45.00 USD

You would obtain

  (1000, 2014-09-12, "*", "WALMART", 20000, "Assets:Checking", -45.00,
"USD")
  (1000, 2014-09-12, "*", "WALMART", 20001, "Expenses:Toys", 45.00, "USD")

Do you not see that the entry fields are duplicated?
Ok, so you might think, this is not much of a problem.

Well, you _can_ generate many meaningful reports this way, but not all
(read on). This is conceptually how Ledger works AFAIK: the filter applies
against the list of postings, as they appear in the file (Ledgerfolk:
please correct me if I'm wrong, this is how I understand it to be).

Deduplication

The problem I referred to as deduplication is that of combining rows of the
same transaction. This is how you would normally want to render a journal's
entries. For example:

  2014-09-12 * "WALMART"
    Assets:Checking    -45.00 USD
    Expenses:Toys    15.00 USD
    Expenses:Toys    30.00 USD

This should ideally render as a single entry in a journal for
Expenses:Toys. Beancount's journal rendering routines do this
automatically. A GROUP BY for all entry columns for a result set on a list
of postings would be awkward.

Operations on Inventories

A more difficult problem is that the Ledgerverse systems admit any number
and type of units to be stored and accumulated in association with an
account. Accounts don't have associated currencies. This means that summing
the changes defined by postings is not a simple arithmetic operation.
Inventory objects are more like "mappings" that contains associations of
(currency, cost, acquisition-date) to a number of units. Adding a new
position has one of two effects: it either creates a new lot, or if there
is already a lot for that (currency, cost, acquisition-date), it adds units
to that lot. Reducing a position is handled differently by Ledger and
Beancount: in Ledger, the operation works the same as with the addition of
new lots, except with a negative number. It allows positive and negative
lots in the same inventory. This naturally supports other methods of
booking (you can average all the cost bases, for example) but has the
unfortunate effect of admitting impossible situations, such as an inventory
with a negative cost basis. Reducing a position in Beancount is more
strict: it requires that an existing lot exist to be removed from.
Therefore, it never allows for both positive and negative numbers for the
same lot. This creates some complications of its own, in particular when
attempting to book positions at average cost (the lots of an inventory need
to be reaggregated) but has the advantage of preventing impossible
situations and resulting only in correct inventories. Both methods have
benefits and difficulties.

Such operations on inventories are difficult to model in SQL. I don't think
it can be modeled easily. I think it's preferable to just provide a query
engine that natively supports such an inventory object upon which positions
can be aggregated.

My new syntax will allow the user to join on the resulting inventory
positions to flatten them out to detailed lists of holdings (these are
useful to compute attributes of portfolios). For example, if the balance of
an account (or a group of accounts) is 20 GOOG, 25 AAPL and 30 MSFT, by
default a result would output a single row like this:

    ....     20 GOOG, 25 AAPL, 30 MSFT

So the user should be able to provide a special option ("FLATTEN") like
this:

  SELECT ... FROM ... WHERE ... FLATTEN;

to produce one result row for each lot, like this:

   .... 20 GOOG
   .... 25 AAPL
   .... 30 MSFT

Filtering on Entries

Another issue is that it's quite common to perform filtering on
transactions separately from filtering on postings. This has implicitly
been the default in Beancount. For instance, the most common thing I do is
to bring up the web interface, then drill down in the particular year I'm
interested in (this is a filter on all transactions for that year), and
then to bring up a journal/register listing for a particular account (this
is a filter on the postings). There are many use cases. Here's another one:
filter to the list of transactions for a particular tag, e.g., transactions
associated with a travel/trip, and then show all postings from accounts
under "Expenses:Transportation:...".

Now, it is true that both the previous examples are selectable on a single
joined table of postings & transactions (IF you have set of operations on
the list of tags, for the last one). Here's one that is slightly more
difficult:

  "filter all the transactions that have at least one posting to an account
which begins with "Assets:US:Vanguard" and report all postings associated
with these transactions with an account beginning with 'Income'"

This is also doable, like this:

  sqlite> select account, sum(number), currency from postings where id in
(select distinct t.id  from transactions as t join postings where account
glob 'Assets:US:Vanguard:*') and account glob "Income:*" group by account,
currency;
but it's slightly awkward. Since this is so common, I'd rather do something
like this:

  SELECT account, sum(change) FROM has_account('Assets:US:Vanguard:*')
WHERE account ~ "Income:*" group by account;

Finally, and more importantly, the ability to perform transformations on
the list of transactions, such as the "close" operation I describe later in
this email is impossible in SQL. This operation will be available via
syntax like this:

  SELECT .... FROM .... CLOSE ON 2014-01-01  WHERE ...

In this example, "CLOSE ON 2014-01-01" means "summarize all the entries
before Jan 1st 2014".

> Did you know that on a gnucash stored db
> ===
> Select
>   Sum(splits.value_num),
>   Sum(splits.quantity_num)
> From
>   splits
> ===
> gives me 0, 0 ?
>
> [1] see below
>
> that *is* the simplest balance sheet
>

How useful is that balance sheet?
Are you able to deal with multiple currencies?
How about rendering the values at cost? At market value?
How about rendering of each lot independently?

I'm going to try to fiddle with GnuCash to understand more clearly how it
tracks cost basis and what the resulting value and quantity numbers are and
summarize it.

what is it you are struggling to represent ?
>
> surely, in spite of what you say, it is complexity you want rather than
> simplicity ?

The minimal amount of complexity that will provide adequate reports.

(I suspect it's possible that your ledger is very simple and you might not
have encountered any of the situations we're attempting to deal with.)

 Your own recently mentioned implementation (bean-sql) creates multiple
>>> tables but it could actually be done in one table if you simply regard
>>> SQLite3 (the overhead and fuss to new users mitigates against the others)
>>> as a very efficient storage method and just take it from there.
>>>
>>>
>> No you can't.
>> How would you represent the two-level tree structure of "transactions <-
>> 1:N -> postings" with
>> - a single relational table, and
>> - no duplication (otherwise it's a derived form and not suitable for
>> representation)
>> ?
>>
>
> exactly as beancount and ledger-cli do now with line numbers as a kludge
> to represent a transaction if you want to be perverse, otherwise you put
> them next to each other or blob them, your choice, but not a problem that
> says only a plain text file makes this better because I haven't thought
> about it enough
>

I don't understand what you're trying to say.

contemporary dbs can easily split out stuff from a blob better than you
> spending years refining some transactional thing that only you care about,
> promise

Years? Yesterday alone I wrote a pretty complete SQL parser and today I'm
writing an interpreter for it.

Here:
https://bitbucket.org/blais/beancount/src/tip/src/python/beancount/query/query_parser.py?at=shell

You need to understand that despite its ubiquity and familiarity, SQL isn't
a silver bullet for everything. It is by no means perfect. Its does not
express hierarchical data formats well, which is partly why object
databases like MongoDB are enjoying a fair amount of popularity (not so
long ago all the rage was XML-based datastores with better defined
schemas). You can also see traces of this in all the ways that people
writing an access language for hierarchical formats have had to get
creative in order to provide access to nested and repeated fields, e.g.
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf
Also see how the different SQL database implementations each have their own
warts and limitations.

It is entirely reasonable to extend SQL and propose specializations for
particular use cases--everyone does it. You seem to think that SQL is some
sort of superior technology that has solved all data access specification
problems. SQL is a way to specify access to your data, but it is still the
case that application writers need to occasionally combine the output of
multiple queries with some code to perform some operations. I think that
with minimal extensions I can define a variant that works well for our
little accounting database.

>  If you have multiple tables, you end up having to do joins and
>> deduplication.
>>
>
> joins yes, deduplication no
>
> why do you think deduplication would be necessary?  genuinely puzzled

(See detailed explanation above with example about what I called
"deduplication".)

 With a single table, you have duplication of information.
>>
>
> nope, a single table could be just as simple as a plain text file
>
>  At the moment I output separate tables for transactions and postings,
>> which
>> can be joined, but it still doesn't allow you to make great queries that
>> involve the kind of filtering that is convenient for generating, e.g.
>> balance sheets.
>>
>
> as above [1] , you are incorrect

(I think I expressed myself clearly enough with the above.)

 The Beancount Query Language will specifically address these issues by:
>> - allowing filtering to occur at two levels in a single query:
>>  * filtering at the entries level (with predicates on postings allowed),
>> plus
>>
>
> free with SQL
>
>   * filtering at the postings level
>> - support inventory arithmetic and aggregation functions on inventory
>> objects and rendering functions (e.g., COST()).
>> - support flattening of multiple lots (to produce lists of holdings each
>> with their cost basis)
>>
>
> free with SQL if you know what you are doing

I explained the problems above.

If you prefer to use straight-up SQL, bean-sql creates the corresponding
database for you and you can use that if you wish. Ledger has an equivalent
conversion to SQL (I haven't looked at the tables it produces but it's fair
to assume there's more than one table). You're welcome to copy (within the
constraints of GPL) and change the script to make it output its tables
differently if you wish, this should be simple. In fact, we welcome this
kind of experimentation. Show us how you can generate all the same reports
using SQL only.

 - explicitly support the "close" operation as syntax, that inserts
>> transactions to clear income statement accounts to equity and removes past
>> history
>>
>
> Ah, now you see, I disagree with that from an accounting and mathematical
> POV.  I explicitly don't want that.  Transactions are a continuum over
> time, reporting on them may be necessary because one or other government
> says so but does what actually happened change because I have to state a
> position on a date to an authority ?  I say that is a report not a
> transaction, my accounts close when I die.

I think I might need to explain more clearly what I mean by "close" here.
You read this wrong. I'm referring to previous discussions that involved
"closing the year." The process of "closing" in Beancount refers to

- the insertion of transactions at a particular date that transfer the
balances of the Income and Expenses accounts to an Equity account (previous
earnings) to zero them out

- the insertion of a conversion entry to account for changes due to
currency conversion (to address the same problem you've been referring to
as "Trading accounts")

- the removal of previous transactions before the given date and
simultaneous insertion of "opening balances" transactions that maintains
the initial balance of those accounts

The idea is that after closing at a particular date, you have (1) removed
history and replaced it by summarizing transactions that initialize the
balances correctly and (2) income statement accounts are zero'ed out. This
results in a list of entries prepared to report on an exercise period that
begins on the given date.

You can find the source code in this file:
https://bitbucket.org/blais/beancount/src/tip/src/python/beancount/ops/summarize.py?at=default

I'm going to explain this operation with some simple diagrams in a
presentation I'm preparing. This is not experimental work... I've been
generating balance sheets and income statements with this process on my
entries for years now. This works well.

>  I claim that with the features above you can explicitly model all the
>> queries being done in Ledger, HLedger and Beancount.
>>
>
> I've read you say things like that before. Ho hum.

So the claim here is an interesting one actually, it's not just me "saying
things like that": so far, Ledger, Beancount and HLedger all have produced
reports generated by separate custom procedures for producing "balances"
and "journals" (or "registers" as Ledger calls them). What I'm claiming is
that we can _unify_ the generation of both types of reports from a single
syntax, simply by

(1) supporting aggregation functions that work on the inventory container
objects that we use internally to sum up position changes from postings, and
(2) by providing access to a *special* column (I call it "balance"), whose
value is dynamically computed to be the running sum of all the postings
rendered so far.

A "journal" just iterates over its postings, optionally rendering the
running balance by selecting the "balance" column:

  SELECT date, account, change, balance FROM ...

A "balance" report is simply an aggregation on the position of each posting:

  SELECT account, sum(change) FROM ... GROUP BY account

So the end result should be that instead of having multiple "custom" report
codes, we should be able to define a single query language that addresses
both types of outputs ... and possibly many more. I think there's
potentially great power hiding in that generalization!

 Within a few minutes you will start seeing your data and wanting to make
>>> some impression on it like "I want to put the transaction and transaction
>>> lines seperately" ... which is what you did :)
>>>
>>
>>
>> I don't understand this sentence.
>>
>
> What is unclear?
>
> you wrote bean-sql to represent a plain text beancount file as more than
> one table, why ? because it represented the data well is my suggestion.
>

I still don't understand what you intend to say with this, but see my
example above. It should be clear that two tables with a join is the
minimum required to represent the data correctly.

So here's the thing: we're interested in solving problems like these:
- Computing returns on portfolios in the face of contributions and
distributions, including commissions and other costs;
- Automatically handling cost basis for washed sales;
- Tracking of invoices and their related payments;
- Tracking investments with average cost basis;
and much, much more.

Do you have any solutions for problems like these?
If so, can we see some code?
How about some ideas in a sufficient amount of detail we can discuss them?

You're welcome to complain, but unless you provide sufficient and specific
detail so we can have a technical conversation about the problems (and not
the people), it's not very productive. Coded solutions are best.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: A Query Language for Beancount

Reply via email to