Thank you, Samuel. I am trying some of this out right now...
This is great information. Thanks so much. This is a huge help. Bob On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler <sgend...@ideasculptor.com>wrote: > One last comment - an alternative solution to this is to use array_agg(), > which will simply add each value that matches a group by clause to an > array. If you are careful to structure your query with left joins so that > every grouping will have the same number of rows, then you can get a quick > and dirty crosstab like this > > select row_name, array_agg(value) from test t left join value_table v on > t.id = v.id group by 1; > > Obviously, you'll want to deal with your potential for duplicate rows via > max() or avg() in a subquery, rather than joining directly to the table, > but you should get the idea from that example. You can also use coalesce > to convert nulls to some other value, if required. > > Since the crosstab functions already require you to do all that work with > regard to determining column names and building up the record structure, > using array_agg can be every bit as effective, since it basically requires > the same process. First query for all possible names, then issue a query > that will cause the values to be processed by array_agg in column order, > then iterate over results, getting each array value and associating it with > a particular name. > > Your result will look like this: > > id_name, start_time, end_time, array_of_values > > That may or may not be convenient for you, depending upon how you are > using the resultset you get back. You'll still need to play all the same > games with regard to unioning multiple queries together to pivot data from > multiple tables into the same row. > > > > On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <sgend...@ideasculptor.com > > wrote: > >> >> >> On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck <buck.rober...@gmail.com>wrote: >> >>> Hi Samuel >>> >>> Thank you. This may be a bit of a stretch for you, but would it be >>> possible for me to peek at a sanitized version of your cross tab query, for >>> a good example on how to do this for this noob? >>> >>> This will be pretty common in my case. The biggest tables will get much >>> larger as they are raw metrics feeds, which at some point need to be fed >>> through reporting engines to analyze and spot regressions. >>> >>> Lastly, am I simply using the wrong tech for data feeds and analytics? >>> The first cut of this used flat files and R and though it scoured thousands >>> of files was much faster than the SQL I wrote here. The big goal was to get >>> this off disk and into a database, but as its highly variable, very sparse, >>> metric data, this is why I chose k-v. SQL databases are internally more >>> politically acceptable, though I am personally agnostic on the matter. In >>> the end it would be nice to directly report off a database, but so long as >>> I can transform to csv I can always perform reporting and analytics in R, >>> and optionally map and reduce natively in Ruby. Sane? Ideas? This is early >>> on, and willing to adjust course and find a better way if suggestions >>> indicate such. I've heard a couple options so far. >>> >> >> OK, you owe me. I think I just spent more than an hour writing this up >> ;-) >> >> Given the numbers of rows you are talking about, I can't think of any >> good reason why the database shouldn't do what you need it to do pretty >> effectively/quickly. >> >> Some questions before I provide crosstab samples - your example query has >> a hardcoded set of keys that it is looking for, but maybe that was >> code-generated. There are multiple forms of the crosstab function, some of >> which rely on a constant column count and others which can generate the set >> of columns based on another query. There are complications in that second >> form relating to race conditions with regard to new keys showing up between >> the query for the columns and the query for the data, so it is important to >> understand that and either structure queries accordingly or make sure you >> execute in a transaction with sufficient transaction isolation to prevent >> the race condition from appearing. >> >> For crosstab queries, you generally want a query that returns results in >> the following form: >> >> row_name | key | value >> >> where row_name would be a date or some other label that all values that >> should share a row will have in common. Key is the field that defines a >> column in the final row. And value is the value to go in that column. If >> it is the case that you always want ALL keys, then you can simply do a left >> join to ensure that you get a row for every key, regardless of whether >> there is an actual value. You can use COALESCE to turn nulls resulting >> from left joins that don't match into '0' values. If there is the >> potential for multiple rows with the same key and row_name, you need to use >> an aggregate function (like max or avg) to compress all rows into a single >> row. >> >> The crosstab(text sql) form of the function requires that it gets a value >> for every possible column in every single row. It makes some naive >> assumptions about the presence of a value for every spot in the matrix. It >> also cannot include extra columns as plain payload in the final cross tab >> result. For greater flexibility, you really want to use the crosstab(text >> sql, text sql) form of the function. This allows you to specify the set of >> columns that should appear via the 2nd query. When processing the results, >> it makes fewer naive assumptions about the presence of data for every >> column, so it correctly handles missing data (and, I assume, data for >> columns that didn't appear in the 2nd query). This eliminates the need for >> left joins to get empty rows, though you do still need to use aggregates to >> combine multiple rows into one row. >> >> Your 2nd query would be a query that simply returns the distinct set of >> key names from both of your key/value tables. You want them in a >> consistent and repeatable order, so order by id_name: >> >> select id_name from ( >> select distinct m.id_name from metadata_key m >> union >> select distinct d.id_name from metric_def d >> ) q order by id_name >> >> The nested union query is required based on what I can gleam of your >> schema from your examples, since you are getting keys from multiple tables >> and we need them all in a single column in a consistent order. >> >> Now you just need a query that returns >> >> row_name | extra_col1 | extra_col2 | key | max(value) >> >> extra_col1 and extra_col2 are going to be the begin_time and end_time >> from your test table, but you are going to have to do the union thing again >> in order to extract results from multiple tables into a single column, and >> you'll have to do the subquery thing in order to get the rows in consistent >> order again: >> >> select row_name, begin_time, end_time, category, value from >> ( >> select t.id_name as row_name, max(t.begin_time) as begin_time, >> max(t.end_time) as end_time, >> m.id_name as category, max(v.value_end) as value >> from test t, metadata_key m, test_variables v >> where v.test_id = t.id and m.id = v.metadata_key_id group by 1, 4 >> >> union all -- use 'union all' to prevent the database from trying to >> de-dupe rows >> >> select t.id_name as row_name, max(t.begin_time) as begin_time, >> max(t.end_time) as end_time, >> d.id_name as category, max(r.value_end) as value >> from test t, test_results r, metric_def d >> where v.test_id = r.test_id and d.id = r.metric_def_id >> ) q >> order by row_name, category >> >> order by is important, as the crosstab processor is naive - it processes >> a row until a new row_name is found, then moves on. If it encounters the >> same row_name later on, it will create a new row with the same name rather >> than going back and adding columns to the existing row. I don't think >> category order is important in this case, since it is doing category lookup >> by name rather than position, but all of my example code does order >> categories, too, probably just to be safe. >> >> Now there is one final bit of wierdness related to the crosstab >> functions. You must specify the structure of the returned results when you >> call it, because it returns a setof record. >> >> So you either call it like this: >> >> select * from crosstab(sql1, sql2) as myresult(text rowname, timestamp >> begin_time, timestamp end_time, int col1, int col2, int col3, int col4, int >> col5,...) >> >> or if you know that you will have exactly 35 columns every single time, >> you can declare an alias to the function which specifies the structure of >> the returned records, which will allow you to call it simply as: >> >> select * from crosstab35(sql, sql); >> >> The documentation for the tablefunc module ( >> http://www.postgresql.org/docs/9.1/static/tablefunc.html ) explains how >> to set up the aliases you might want if you have a consistent column set. >> If you don't have a consistent column set, you have to first query for the >> names of all possible columns. Then construct your two queries for the >> crosstab function, along with the 'as' clause, using the names that were >> returned from your initial call to limit the results in the eventual >> crosstab execution. In other words, you need to make sure that the 2nd sql >> statement in the crosstab call cannot possibly return more columns than the >> set of columns you define in the 'as' clause, so you must either execute >> both queries in a transaction with at least REPEATABLE_READ isolation >> semantics, or else include a where clause in your column query that limits >> the columns returned to the same set of columns you got in the first query >> - select column_name from table where column_name in (...). >> >> Basically, the flow looks like this: >> >> select all possible column names from database. >> >> build data query for crosstab func. >> build column query for crosstab func (should be the same or similar to >> the earlier query for column names). >> build 'as' clause for crosstab func - I just name my columns a1, a2, a3, >> a4, etc, knowing that they will be in the same order as my column names, >> since my column names are often not strings that would be legal column >> names in a query. >> >> execute query "select * from crosstab(" + data_query + "," + column_query >> + ")" + as_clause >> process crosstab result >> >> I'll admit, it's a little difficult to set up correctly the first time >> you do it. But execution is actually pretty efficient, though it is only >> doing pretty much exactly what you'd do if you were to pivot the data >> manually while iterating over the resultset for the data query. If you're >> working in a a dynamically typed language that lets you easily write code >> that has efficient hash maps and the like, and you only have this one use >> case, it might actually be easier just to query for the raw data and pivot >> it yourself, where you'll have much more flexibility in dealing with column >> names and such. But if you have a wide variety of uses, it's probably >> worthwhile to become familiar with the crosstab functionality, since it is >> pretty simple to use once you get the hang of it. I've got fairly >> boilerplate code that deals with things like generating the record >> structure so I can usually add a new bit of crosstab functionality to an >> app pretty darn quickly these days. >> >> One thing that's a drag is that the queries are just strings as far as >> the database is concerned, so depending upon your client library, you may >> not be able to use variable escaping best practices. Basically, jdbc won't >> allow you to do use prepared statement variables within a string, so you >> have to manually escape any user-provided input while constructing the >> query as a raw string, rather than using '?' in your query. The same goes >> for trying to slip a crosstab query though hibernate's SQLQuery interface >> (I thought that might be a workaround, but it wasn't), so you are forced to >> construct the entire query string manually. >> >> --sam >> >> >> >