On 1/16/17 10:09 PM, Haribabu Kommi wrote:
Yes, that' correct. Currently with this approach, it is not possible to
heap completely. This approach is useful for the cases, where the user wants
to store only some columns as part of clustered index.
Ahh, that's unfortunate. Billion row+ tables are becoming rather common,
and that 24GB of overhead starts becoming very painful. It's actually a
lot worse considering there will be at least one index on the table, so
100GB+ of overhead isn't that uncommon.
Another complication is that one of the big advantages of a CSTORE
is allowing analysis to be done efficiently on a column-by-column
(as opposed to row-by-row) basis. Does your patch by chance provide
Not the base patch that I shared. But the further patches provides the
column-by-column basis using the custom plan methods.
Great, that's something else that a column store really needs to be
successful. Something else I suspect is necessary is a faster/better way
to eliminate chunks of rows from scans.
Just as an example, with my simple array-based approach, you can store a
range type along with each array that contains the min and max values
for the array. That means any query that wants values between 50 and 100
can include a clause that filters on range types that overlap with
[50,100]. That can be indexed very efficiently and is fast to run checks
Generally speaking, I do think the idea of adding support for this
as an "index" is a really good starting point, since that part of
... as discussed elsewhere in the thread, adding a bunch of hooks is
probably not a good way to do this. :/
That would be a great way to gain knowledge on what users would want
to see in a column store, something else I suspect we need. It would
also be far less code than what you or Alvaro are proposing. When it
comes to large changes that don't have crystal-clear requirements, I
think that's really important.
The main use case of this patch is to support mixed load environments,
where both OLTP and OLAP queries are possible. The advantage of
proposed patch design is, providing good performance to OLAP queries
without affecting OLTP.
Yeah, that's a big part of what I was envisioning with my array-based
approach. In simple terms, there would be a regular row-based table, and
an array-based table, with a view that allows seamless querying into
both (re-presenting the array-storage on a per-row basis). There would
be a periodic process that moves entire sets of rows from the row
storage into the array storage.
If you updated or deleted a row that was part of an array, the contents
of the entire array could be moved back into row-based storage. After a
period of time, rows would get moved back into array storage. Or the
array could be modified in place, but you need to be very careful about
bloating the array storage if you do that.
The big missing piece here is getting the planner to intelligently
handle a mixed row/column store. As I mentioned, you can easily add
range type fields to greatly increase performance, but they won't do any
good unless the appropriate filters get added. It's not THAT hard to do
that by hand, but it'd be great if there was a more automated method.
Such a method might also be very useful for transforming expressions
like date_part('quarter', ...) into something that could use existing
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: