On Sun, Oct 19, 2014 at 4:38 AM, Wm... <[email protected]> wrote:
> Sat, 18 Oct 2014 13:30:15 <CAK21+hOon=Aph+bBTSMwxw-5p08- > [email protected]> > Martin Blais <[email protected]> wrote... > > On Sat, Oct 18, 2014 at 12:45 PM, Wm... <[email protected]> >> wrote: >> >> Thu, 2 Oct 2014 23:37:09 <CAK21+hMx7wgt0imEtHp_WTFuYaqLTPJ+OwFHM+wjPS_ >>> [email protected]> >>> Martin Blais <[email protected]> wrote... >>> >>> [not sub'd to hledger so dropped from ng list] >>> >>> It's always important to me to state the reasons why this is not >>> possible >>> >>>> as a regular table: filtering has to be operable on complete >>>> transactions >>>> (with all legs), and operations on inventories of commodities have to be >>>> possible. These are the reasons that we cannot just spit out the data to >>>> some SQLite table and then use that (I'd like to see someone argue the >>>> opposite). >>>> >>>> >>> Given that you seem to write your thoughts out I've missed the doc where >>> you say what the reasons are it can't be done in SQLite (or postgres or >>> mysql) tables. Bear in mind that GnuCash uses any of the 3 mentioned dbs >>> as storage methods alongside XML (cf plain text file). >>> >>> >> I'm not so much interested in the storage method; a relational database of >> XML can represent the internals of Beancount alright, that's easy, >> > > ok BTW I meant OR, not OF. A relational database, an XML file, a JSON file, and many more are suitable formats for this simple dataset. Representing the data is not the problem. Specifying outputs suitable to produce common accounting reports (in the face of cost basis, multiple currencies) is the minor challenge at hand. > > > the >> issue is making simple queries on tree-structured data that produces >> reports suitable for accounting, with operations on inventories of >> positions. >> > > I'm not sure I'm getting this. Surely most of the objects (point in time, > something, amount of that something, who had it before and who had it > afterwards) are easily defined and represented with a bit of glue saying > that-one-over-there and this-one-here belong together, let's call it a > transaction ? > This is a simple and common database-related problem. You have at least two tables: One table contains information attached to the transaction. The other table contains information attached to the posting. Let's illustrate: let us simplify things grossly and say that transactions have only (entry-id, date, flag, payee) fields and that postings have only (posting-id, entry-id, account, amount, currency) fields. You can join the two tables on some id (easily doable) to obtain a table of postings plus their corresponding entry data. If so, you get rows of (entry-id, date, flag, payee, posting-id, account, amount, currency) ... The rub here is that you have duplication of the entry information. For example, given the following transaction: 2014-09-12 * "WALMART" Assets:Checking -45.00 USD Expenses:Toys 45.00 USD You would obtain (1000, 2014-09-12, "*", "WALMART", 20000, "Assets:Checking", -45.00, "USD") (1000, 2014-09-12, "*", "WALMART", 20001, "Expenses:Toys", 45.00, "USD") Do you not see that the entry fields are duplicated? Ok, so you might think, this is not much of a problem. Well, you _can_ generate many meaningful reports this way, but not all (read on). This is conceptually how Ledger works AFAIK: the filter applies against the list of postings, as they appear in the file (Ledgerfolk: please correct me if I'm wrong, this is how I understand it to be). Deduplication The problem I referred to as deduplication is that of combining rows of the same transaction. This is how you would normally want to render a journal's entries. For example: 2014-09-12 * "WALMART" Assets:Checking -45.00 USD Expenses:Toys 15.00 USD Expenses:Toys 30.00 USD This should ideally render as a single entry in a journal for Expenses:Toys. Beancount's journal rendering routines do this automatically. A GROUP BY for all entry columns for a result set on a list of postings would be awkward. Operations on Inventories A more difficult problem is that the Ledgerverse systems admit any number and type of units to be stored and accumulated in association with an account. Accounts don't have associated currencies. This means that summing the changes defined by postings is not a simple arithmetic operation. Inventory objects are more like "mappings" that contains associations of (currency, cost, acquisition-date) to a number of units. Adding a new position has one of two effects: it either creates a new lot, or if there is already a lot for that (currency, cost, acquisition-date), it adds units to that lot. Reducing a position is handled differently by Ledger and Beancount: in Ledger, the operation works the same as with the addition of new lots, except with a negative number. It allows positive and negative lots in the same inventory. This naturally supports other methods of booking (you can average all the cost bases, for example) but has the unfortunate effect of admitting impossible situations, such as an inventory with a negative cost basis. Reducing a position in Beancount is more strict: it requires that an existing lot exist to be removed from. Therefore, it never allows for both positive and negative numbers for the same lot. This creates some complications of its own, in particular when attempting to book positions at average cost (the lots of an inventory need to be reaggregated) but has the advantage of preventing impossible situations and resulting only in correct inventories. Both methods have benefits and difficulties. Such operations on inventories are difficult to model in SQL. I don't think it can be modeled easily. I think it's preferable to just provide a query engine that natively supports such an inventory object upon which positions can be aggregated. My new syntax will allow the user to join on the resulting inventory positions to flatten them out to detailed lists of holdings (these are useful to compute attributes of portfolios). For example, if the balance of an account (or a group of accounts) is 20 GOOG, 25 AAPL and 30 MSFT, by default a result would output a single row like this: .... 20 GOOG, 25 AAPL, 30 MSFT So the user should be able to provide a special option ("FLATTEN") like this: SELECT ... FROM ... WHERE ... FLATTEN; to produce one result row for each lot, like this: .... 20 GOOG .... 25 AAPL .... 30 MSFT Filtering on Entries Another issue is that it's quite common to perform filtering on transactions separately from filtering on postings. This has implicitly been the default in Beancount. For instance, the most common thing I do is to bring up the web interface, then drill down in the particular year I'm interested in (this is a filter on all transactions for that year), and then to bring up a journal/register listing for a particular account (this is a filter on the postings). There are many use cases. Here's another one: filter to the list of transactions for a particular tag, e.g., transactions associated with a travel/trip, and then show all postings from accounts under "Expenses:Transportation:...". Now, it is true that both the previous examples are selectable on a single joined table of postings & transactions (IF you have set of operations on the list of tags, for the last one). Here's one that is slightly more difficult: "filter all the transactions that have at least one posting to an account which begins with "Assets:US:Vanguard" and report all postings associated with these transactions with an account beginning with 'Income'" This is also doable, like this: sqlite> select account, sum(number), currency from postings where id in (select distinct t.id from transactions as t join postings where account glob 'Assets:US:Vanguard:*') and account glob "Income:*" group by account, currency; but it's slightly awkward. Since this is so common, I'd rather do something like this: SELECT account, sum(change) FROM has_account('Assets:US:Vanguard:*') WHERE account ~ "Income:*" group by account; Finally, and more importantly, the ability to perform transformations on the list of transactions, such as the "close" operation I describe later in this email is impossible in SQL. This operation will be available via syntax like this: SELECT .... FROM .... CLOSE ON 2014-01-01 WHERE ... In this example, "CLOSE ON 2014-01-01" means "summarize all the entries before Jan 1st 2014". > Did you know that on a gnucash stored db > === > Select > Sum(splits.value_num), > Sum(splits.quantity_num) > From > splits > === > gives me 0, 0 ? > > [1] see below > > that *is* the simplest balance sheet > How useful is that balance sheet? Are you able to deal with multiple currencies? How about rendering the values at cost? At market value? How about rendering of each lot independently? I'm going to try to fiddle with GnuCash to understand more clearly how it tracks cost basis and what the resulting value and quantity numbers are and summarize it. what is it you are struggling to represent ? > > surely, in spite of what you say, it is complexity you want rather than > simplicity ? The minimal amount of complexity that will provide adequate reports. (I suspect it's possible that your ledger is very simple and you might not have encountered any of the situations we're attempting to deal with.) Your own recently mentioned implementation (bean-sql) creates multiple >>> tables but it could actually be done in one table if you simply regard >>> SQLite3 (the overhead and fuss to new users mitigates against the others) >>> as a very efficient storage method and just take it from there. >>> >>> >> No you can't. >> How would you represent the two-level tree structure of "transactions <- >> 1:N -> postings" with >> - a single relational table, and >> - no duplication (otherwise it's a derived form and not suitable for >> representation) >> ? >> > > exactly as beancount and ledger-cli do now with line numbers as a kludge > to represent a transaction if you want to be perverse, otherwise you put > them next to each other or blob them, your choice, but not a problem that > says only a plain text file makes this better because I haven't thought > about it enough > I don't understand what you're trying to say. contemporary dbs can easily split out stuff from a blob better than you > spending years refining some transactional thing that only you care about, > promise Years? Yesterday alone I wrote a pretty complete SQL parser and today I'm writing an interpreter for it. Here: https://bitbucket.org/blais/beancount/src/tip/src/python/beancount/query/query_parser.py?at=shell You need to understand that despite its ubiquity and familiarity, SQL isn't a silver bullet for everything. It is by no means perfect. Its does not express hierarchical data formats well, which is partly why object databases like MongoDB are enjoying a fair amount of popularity (not so long ago all the rage was XML-based datastores with better defined schemas). You can also see traces of this in all the ways that people writing an access language for hierarchical formats have had to get creative in order to provide access to nested and repeated fields, e.g. http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf Also see how the different SQL database implementations each have their own warts and limitations. It is entirely reasonable to extend SQL and propose specializations for particular use cases--everyone does it. You seem to think that SQL is some sort of superior technology that has solved all data access specification problems. SQL is a way to specify access to your data, but it is still the case that application writers need to occasionally combine the output of multiple queries with some code to perform some operations. I think that with minimal extensions I can define a variant that works well for our little accounting database. > If you have multiple tables, you end up having to do joins and >> deduplication. >> > > joins yes, deduplication no > > why do you think deduplication would be necessary? genuinely puzzled (See detailed explanation above with example about what I called "deduplication".) With a single table, you have duplication of information. >> > > nope, a single table could be just as simple as a plain text file > > At the moment I output separate tables for transactions and postings, >> which >> can be joined, but it still doesn't allow you to make great queries that >> involve the kind of filtering that is convenient for generating, e.g. >> balance sheets. >> > > as above [1] , you are incorrect (I think I expressed myself clearly enough with the above.) The Beancount Query Language will specifically address these issues by: >> - allowing filtering to occur at two levels in a single query: >> * filtering at the entries level (with predicates on postings allowed), >> plus >> > > free with SQL > > * filtering at the postings level >> - support inventory arithmetic and aggregation functions on inventory >> objects and rendering functions (e.g., COST()). >> - support flattening of multiple lots (to produce lists of holdings each >> with their cost basis) >> > > free with SQL if you know what you are doing I explained the problems above. If you prefer to use straight-up SQL, bean-sql creates the corresponding database for you and you can use that if you wish. Ledger has an equivalent conversion to SQL (I haven't looked at the tables it produces but it's fair to assume there's more than one table). You're welcome to copy (within the constraints of GPL) and change the script to make it output its tables differently if you wish, this should be simple. In fact, we welcome this kind of experimentation. Show us how you can generate all the same reports using SQL only. - explicitly support the "close" operation as syntax, that inserts >> transactions to clear income statement accounts to equity and removes past >> history >> > > Ah, now you see, I disagree with that from an accounting and mathematical > POV. I explicitly don't want that. Transactions are a continuum over > time, reporting on them may be necessary because one or other government > says so but does what actually happened change because I have to state a > position on a date to an authority ? I say that is a report not a > transaction, my accounts close when I die. I think I might need to explain more clearly what I mean by "close" here. You read this wrong. I'm referring to previous discussions that involved "closing the year." The process of "closing" in Beancount refers to - the insertion of transactions at a particular date that transfer the balances of the Income and Expenses accounts to an Equity account (previous earnings) to zero them out - the insertion of a conversion entry to account for changes due to currency conversion (to address the same problem you've been referring to as "Trading accounts") - the removal of previous transactions before the given date and simultaneous insertion of "opening balances" transactions that maintains the initial balance of those accounts The idea is that after closing at a particular date, you have (1) removed history and replaced it by summarizing transactions that initialize the balances correctly and (2) income statement accounts are zero'ed out. This results in a list of entries prepared to report on an exercise period that begins on the given date. You can find the source code in this file: https://bitbucket.org/blais/beancount/src/tip/src/python/beancount/ops/summarize.py?at=default I'm going to explain this operation with some simple diagrams in a presentation I'm preparing. This is not experimental work... I've been generating balance sheets and income statements with this process on my entries for years now. This works well. > I claim that with the features above you can explicitly model all the >> queries being done in Ledger, HLedger and Beancount. >> > > I've read you say things like that before. Ho hum. So the claim here is an interesting one actually, it's not just me "saying things like that": so far, Ledger, Beancount and HLedger all have produced reports generated by separate custom procedures for producing "balances" and "journals" (or "registers" as Ledger calls them). What I'm claiming is that we can _unify_ the generation of both types of reports from a single syntax, simply by (1) supporting aggregation functions that work on the inventory container objects that we use internally to sum up position changes from postings, and (2) by providing access to a *special* column (I call it "balance"), whose value is dynamically computed to be the running sum of all the postings rendered so far. A "journal" just iterates over its postings, optionally rendering the running balance by selecting the "balance" column: SELECT date, account, change, balance FROM ... A "balance" report is simply an aggregation on the position of each posting: SELECT account, sum(change) FROM ... GROUP BY account So the end result should be that instead of having multiple "custom" report codes, we should be able to define a single query language that addresses both types of outputs ... and possibly many more. I think there's potentially great power hiding in that generalization! Within a few minutes you will start seeing your data and wanting to make >>> some impression on it like "I want to put the transaction and transaction >>> lines seperately" ... which is what you did :) >>> >> >> >> I don't understand this sentence. >> > > What is unclear? > > you wrote bean-sql to represent a plain text beancount file as more than > one table, why ? because it represented the data well is my suggestion. > I still don't understand what you intend to say with this, but see my example above. It should be clear that two tables with a join is the minimum required to represent the data correctly. So here's the thing: we're interested in solving problems like these: - Computing returns on portfolios in the face of contributions and distributions, including commissions and other costs; - Automatically handling cost basis for washed sales; - Tracking of invoices and their related payments; - Tracking investments with average cost basis; and much, much more. Do you have any solutions for problems like these? If so, can we see some code? How about some ideas in a sufficient amount of detail we can discuss them? You're welcome to complain, but unless you provide sufficient and specific detail so we can have a technical conversation about the problems (and not the people), it's not very productive. Coded solutions are best. -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
