Re: [GENERAL] looking for a globally unique row ID

2017-09-14 Thread Karl Czajkowski
On Sep 14, Rafal Pietrak modulated:

> My main objective is the "guarantee". Which I've tried to indicate
> referring to "future UPDATEs".
> 

With a well-behaved application, it is sufficient to define each
ID column as:

   id int8 UNIQUE NOT NULL DEFAULT nextval('sharedsequence')

and ensure that you always use defaults for INSERT and never
set the column during UPDATE.

You can also use a BEFORE/FOR EACH ROW trigger to enforce these
conventions, if you are worried about accidental violations in your
SQL.  In that case, leave the DEFAULT as NULL in the table definition,
but make the trigger do this for INSERT:

   NEW.id := nextval('sharedsequence');

and this enforcement check for UPDATE:

   IF OLD.id != NEW.id THEN RAISE EXCEPTION 'id is immutable';

If that's not a strong enough guarantee, I'm not sure I understand
your problem statement...


Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] looking for a globally unique row ID

2017-09-14 Thread Karl Czajkowski
On Sep 14, vinny modulated:

> If it is only one database, on one server, then couldn't you just
> use one sequence?
> If oyu prefix the value with some identifier of the current table
> then you cannot get duplicates
> across tables even if you reset the sequence.
> 

I didn't follow the whole thread, so I apologize if I'm repeating
earlier suggestions.

We use a shared sequence to issue the new identifiers, and in fact
limited the sequence to 43 bits so they can serialize as JSON numbers
for the benefit of our clients.  We disabled wrapping, so it will fail
if we exhaust the range.

If you rapidly churn through identifiers and could envision exhausting
64 bits in your database's lifetime, you should probably just use
UUIDs instead of a sequence.  A timestamp-based UUID still has 
reasonably sorting and indexing properties.

To "guarantee" uniqueness with a shared sequence or UUID generator,
you can use a trigger to prevent override of identifiers from SQL. As
long as you always use the correct value generator during INSERT and
disallow mutation of identifiers during UPDATE, the rows will not
share identifiers.


Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PostgreSQL with PowerBuilder, and Identity keys (serials)

2017-08-05 Thread Karl Czajkowski

>   Select currval('GEN_')
>

>From the above, I am assuming you did something like:

   CREATE SEQUENCE "GEN_" ...;

and are trying to access this sequence?  If so, you actually have to
include the SQL quoted identifier syntax within the text argument to
currval() or nextval(), e.g.

   SELECT nextval('"GEN_"');

With these sorts of identifier-as-argument parameters in Postgres, you
can also include schema-qualification syntax:

   SELECT nextval('"My Schema"."GEN_"');


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] jsonb case insensitive search

2017-06-01 Thread Karl Czajkowski
On Jun 01, armand pirvu modulated:

> Overall could it be that the optimizer blatantly ignores a scan index which 
> is cheaper than a table scan, or jsonb implementation still has a long way to 
> come up  or the way it is used in my case is not the one designed for ?
> 

If I remember correctly, isn't a compound index always just using
btree?  In general, I have found better luck using several smaller
btree indices than one large compound one.  Unless your entire query
can be answered from an index-only lookup, the extra columns just
bloat the btree index.

So, you might as well use a simpler compound index for the regular
scalar row keys, and this index will be much smaller without the
baggage of the jsonb values at its leaves.  The planner can use the
jsonb from the actual candidate rows if it is going to have to visit
them anyway for other WHERE or SELECT clauses.

If the sparseness of your query is due to the content within the jsonb
values rather than the other scalar row keys, I think you'd need some
kind of GIN index over the contents of the jsonb documents to find the
small subset of candidate rows by these sparse criteria.  Trigram is
just one example of a GIN indexing scheme.

If your jsonb documents are "flat", i.e. just a bag of key value pairs
and not arbitrary nested jsonb structures, you might also explode them
into arrays of keys or values as separate indexed expressions?  Then,
you could GIN index the arrays and quickly find the subset of rows with
certain unusual keys or unusual values, but would still have to follow
up with a more exact check for the combination of key and value.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] jsonb case insensitive search

2017-06-01 Thread Karl Czajkowski
On May 31, armand pirvu modulated:

> The idea is that I would like to avoid having an index for each key
> possibly wanted to search, but rather say have the whole record then
> search by the key and get the key value, thus having one index serving
> multiple purposes so to speak
> 

First, benchmarking would be important to figure out if any proposed
indexing actually speeds up the kinds of queries you want to perform.
With the recently added parallel query features, a simpler indexing
scheme with some brute-force search might be adequate?

But, you could use a search idiom like this:

 (lower(json_column::text)::json) -> lower('key') = 'value'::json

This will down-convert the case on all values and keys.  The left-hand
parenthetic expression could be precomputed in an expression index to
avoid repeated case conversion. But, typical searches will still have
to scan the whole index to perform the projection and match the final
value tests on the right-hand side.

If you want to do things like substring matching on field values, you
might stick with text and using regexp matches:

 (lower(json_column::text)) ~ 'valuepattern'

or more structural searches:

 (lower(json_column::text)) ~ '"key": "[^"]*substring[^"]*"'

Here, the left-hand expression could be trigram indexed to help with
sparse, substring matching without a full index scan.  We've had good
luck using trigram indexing with regexp matching, though I've honestly
never used it for the purpose sketched above...

Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] import CSV file to a table

2017-03-08 Thread Karl Czajkowski
On Mar 08, Rob Sargent modulated:

> Yes Karl, I agree.  I admitted as much.  But if it's clean, as in
> free of quoted commas, life is much more simple.  I've lost site of
> whether or not the OP knows his situation w.r.t. to this.  The awk
> line will tell him and for a one-off load this can make a world of
> difference in complexity - two bash lines and a COPY.
> 

Maybe I didn't understand your awk? I thought it was counting commas
in lines.  This isn't the same as counting commas in records.

this,is,record,one
"this,,","is
,,record","two
,,,"

this has three commas on each line and definitely is not suitable
for naive CSV handling.


Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] import CSV file to a table

2017-03-08 Thread Karl Czajkowski
I believe that in its fully glory, you cannot reliably locate CSV
record boundaries except by parsing each field in order including
quote processing.  Individual records may have arbitrary numbers of
field and record separator characters within the values.

Karl


On Mar 08, Rob Sargent modulated:
> Since bash has been bandied about in this thread I presume awk is
> available.  Here's how I would check just how 'csv'ish the incoming
> file is.
> ...


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] import CSV file to a table

2017-03-08 Thread Karl Czajkowski
On Mar 08, John McKown modulated:
...
> ​I agree. I went with a "pure BASH" approach because it is what the
> user asked for & I wasn't sure what language she might be comfortable
> with. I use PERL a lot. Or maybe I should say that I abuse PERL a lot.
> Such as a PERL script with writes out another PERL script, based on
> some input files & parameters, then runs the just written PERL script,
> which does the load into a PostgreSQL database (multiple tables). Ya, a
> bit perverted.​
> 

Well, you could follow a similar meta-programming/code-generating
pattern to have the BASH script output a single SQL file to run with
psql. You could even generate PL/pgsql code to defer more data
processing to the database itself.

I think the only robust "pure BASH" approach is to use a temporary
table, so you aren't trying to parse CSV content in BASH. Using csvkit
sounds good if you can introduce these third-party dependencies.

With the temporary table, you can use SQL for most validation or data
interrogation, but you need to know at least enough schema information
in advance to form the COPY statement. Parsing the CSV header row to
plan your work puts you right back to requiring a robust CSV parser
unless you can constrain your input scenarios to only handle very
trivial headers.

If you play games with a defaulting serial column and fixed column
names like "id, c1, c2, ..., cN" for the temporary table, you might
use the id column as a DB-assigned "row number" during COPY and
validation. In this case, you could even tell Postgres there is no
header, and then let it parse the header as another data record so you
can use SQL statements to determine the actual header names and
ordering in the input.  But this still requires knowing the column
count in advance of the COPY.

I also think using something like Python with structured data
processing would be wiser, unless you know enough about the schema in
advance to avoid any CSV parsing on the client side.


Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: Are new connection/security features in order, given connection pooling?

2017-01-14 Thread Karl Czajkowski
On Jan 14, Tomas Vondra modulated:
...
> Sure, a lot of systems generate queries on the fly. Also, if the
> main problem is poor plan choice due to RLS, I'm not sure how
> prepared statements could help with that.
> 

Sorry for being unclear, I was trying to respond to too many
sub-topics at once.  The performance comment was just a tangential
aside from my experiments using RLS for web client
authorization. Indeed, I use the current_setting() function in RLS
policy statements to access ephemeral web client context stored via
set_config().

In the rest of my message, I was trying to respond to the topic of
being able to "set role" and limit the connection privileges to
reflect a web client's identity for RLS or other fine-grained
authorization. I thought this was the main topic of the thread---the
desire to push down web client authorization decisions into the normal
postgres mechanisms, and all the problems arising from this sort of
approach.

The concern was raised about the risk of the subsequent queries being
able to reverse the "set role" to subvert such authorization. Others
in the thread provided the common advice of parametric queries to
prevent query injection, but I think you need more than that.  I think
one would like a stronger isolation between the connection
setup/security config layer and the fancier (and possibly flawed)
application layer.  We struggle with this because all these mechanisms
are multiplexed as commands in one SQL connection, instead of having
an in-band versus out-of-band protocol layering available to the web
application, so the more risky data-dependent SQL can flow in-band
while the basic request-handler lifecycle could configure the security
context out-of-band.


> That might be an interesting feature, but it's also significantly
> more complex than the topic of implementing a safe context for
> secrets, making RLS less problematic with connection pools.
> 

I was thinking that this mechanism would support the usual shared
connection pool to support all web users.  The web server would
repeatedly redefine the restricted security context and execute
application queries under that context.  The pooling problems come
from the suggestion to open connections w/ different per-user
authentication and then not have any user-switching that can easily
reconfigure the connection, right?


karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Re: Are new connection/security features in order, given connection pooling?

2017-01-11 Thread Karl Czajkowski
I can relate to the original plea from my own exploration of this
topic.

Before I get into that, I will mention as an aside that to date we
have found RLS to be really slow for web client authorization, given
that we have to use session parameters to store web client context and
there doesn't seem to be a way to get the query planner to treat these
settings as constants nor to lift static parts of RLS policy out of
per-row loops.  It's so slow that I wonder whether future releases
will improve it before we are forced to move authorization back into
the client-side application logic, making our generated SQL much more
complex but better tuned to the query planner's capabilities.

As background, our web applications essentially compile new SQL
queries on the fly and send them through to PostgreSQL.  There is a
lot of metaprogramming involved in generating SQL statements to embody
the work we want done on behalf of our clients, and this doesn't fit
well into traditional prepared statements etc.

The web service needs to assert the client context and the
client-level statements we generate and execute should not be able to
override this. The service logic to determine and assert client
context (once per web request) is dramatically simpler than the
service logic producing all the client-level SQL statements, and it
would be nice to have a restricted execution context to isolate the
two kinds.  We also like the idea that policy enforcement mechanisms
could be applied orthogonally to the generated client-level SQL
statements, as it feels like a better defense-in-depth architecture
and is also easier to reason about. To do so in our service logic
would mean replicating more and more of the query parsing and engine
to do general query rewriting.

So, it would be nice to have a tunneling mechanism where I can
distinguish the control-plane operations I am performing from the
application operations I am translating and pushing down to PostgreSQL
on behalf of the web client. In my case, I might want to perform a
mixture of service-level and client-level statements within the same
transaction. The client-level statements cannot control transactions.

I could imagine something like a two-level feature set. At the
top-level in the connection, we can statefully manipulate our security
contexts, set configurable privilege masks for the second-level
execution context, set effective roles (reversibly, as today), manage
transactions, etc.  With some kind of nested execution block, we could
submit less trusted statements to run within the second-level
execution context:

   EXECUTE RESTRICTED $guard$ app_query... $guard$ ;

This would take the guarded query string, restart a safe parser on it,
and only on successful parse go forward with planning and executing it
in the restricted mode that has been configured for the connection.

Eventually, I could see wanting more resource management controls on
this restricted context too, i.e. setting limits of CPU/IO/RAM
consumption or execution time. Both planning time limits (reject based
on estimates) and runtime (abort query if limit is reached).


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] resolution order for foreign key actions?

2016-11-08 Thread Karl Czajkowski
On Nov 08, David G. Johnston modulated:
...
> ON DELETE starts with the "one" side of a one-to-many relationship​. 
> When deleting the one row the question is what should be done with the
> many rows which refer to it.  If ALL of the many rows agree to be
> deleted then the one row in question can go away and no error is
> raised.  If ANY of the many rows refuse to die then the one row in
> question must remain in order to maintain referential integrity - thus
> an error will be raised.
> 

I think I had the same intuition going into this.

However, I am testing with an artificial scenario to focus on the
ordering/precedence behavior. I was even hoping PostgreSQL might raise
an error when I created apparently conflicting constraints, but
unfortunately  it does something much more confusing...

You might consider this to explore what happens if someone
accidentally redefines constraints with conflicting actions.  I just
redefine the same constraint with only varying constraint name and ON
DELETE clause.

Here, I have assigned constraint names to demonstrate that the rules
are NOT applied based on a lexicographic sort of constraint names but
rather on order of definition (perhaps there is another
internally-generated name that sorts in order of definition?):

==
ALTER TABLE refs ADD CONSTRAINT z FOREIGN KEY (t_id) REFERENCES targets 
(id) ON DELETE NO ACTION;
ALTER TABLE
ALTER TABLE refs ADD CONSTRAINT y FOREIGN KEY (t_id) REFERENCES targets 
(id) ON DELETE SET NULL;
ALTER TABLE
ALTER TABLE refs ADD CONSTRAINT x FOREIGN KEY (t_id) REFERENCES targets 
(id) ON DELETE CASCADE;
ALTER TABLE

This test will show that ON DELETE NO ACTION is in effect due to constraint 
z.

 Table "public.refs"
 Column |  Type   | Modifiers 
+-+---
 id | integer | not null default nextval('refs_id_seq'::regclass)
 t_id   | integer | not null
Indexes:
"refs_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"x" FOREIGN KEY (t_id) REFERENCES targets(id) ON DELETE CASCADE
"y" FOREIGN KEY (t_id) REFERENCES targets(id) ON DELETE SET NULL
"z" FOREIGN KEY (t_id) REFERENCES targets(id)

-
Content of refs table before deletion of target:
SELECT * FROM refs;
 id | t_id 
+--
  1 |1
  2 |2
  3 |3
(3 rows)

-
Attempting to delete a target:
DELETE FROM targets WHERE name = 'foo1' RETURNING *;
ERROR:  update or delete on table "targets" violates foreign key constraint 
"z" on table "refs"
DETAIL:  Key (id)=(2) is still referenced from table "refs".


The attached BASH script will perform a sequence of tests defining the
constraints in different orders and showing the results.  The excerpt
above is from the first test scenario.

It accepts optional arguments which are passed to 'psql' and can run
with no arguments if you can talk to your default DB with 'psql'
absent of any arguments, i.e. with Unix domain socket authentication.
It only creates and destroys tables public.targets and public.refs...


Thanks,

Karl



fkr-test.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] resolution order for foreign key actions?

2016-11-08 Thread Karl Czajkowski
Hi,

Is there a formal definition for the order in which constraint actions
(i.e. the ON DELETE or ON UPDATE rules) are applied when there are
multiple overlapping/relevant constraints?

I have struggled to find an answer in the manual, but my experiments
suggest that they are interpreted in the order in which the
constraints were defined and the first rule in this order is applied
while subsequent rules are ignored.  This can be very confusing if one
rule says CASCADE and another NO ACTION, and you need to understand
this order of definition to know whether a delete will cascade or
raise an error.

Is there a definitive way to introspect the informatation_schema or
pg_catalog to determine which behaviors will effectively apply to a
given "DELETE FROM ..." or "UPDATE ..." statement?

Thanks,


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] parallel query and row-level security?

2016-10-28 Thread Karl Czajkowski
On Oct 27, Karl Czajkowski modulated:
> Hi, I am struggling to find information on how the new parallel query
> features intersect with row-level security.  I am wondering whether
> this is expected to work at this time?
> 

Further experimentation shows that I can get parallel plans with
row-level security, if I remove our ugly policy with a scalar subquery
and instead denormalize the decision into another column of the same
table (i.e. one managed by a trigger).

However, our policies are still adding huge overhead.  If I lie and
say our stored procedure is IMMUTABLE rather than STABLE, the
optimizer lifts our first (row-independent) policy check out of the
per-row filter and the query is 10x faster. But it seems to execute it
per-row when it is marked as STABLE, even though the function takes no
arguments and so does not vary by row.

But, our procedure calls current_setting which has STABLE volatility.
Does the session/connection cache plans?  Would declaring our function
as IMMUTABLE accidentally allow reuse of (stale) plans across
boundaries where we have modified the session parameter? My
conservative reading of the docs suggests it might, which would make
this cheat unsafe for us...


Thanks,

Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] parallel query and row-level security?

2016-10-27 Thread Karl Czajkowski
Hi, I am struggling to find information on how the new parallel query
features intersect with row-level security.  I am wondering whether
this is expected to work at this time?

My cursory experiments indicate that we can get a parallel plan when
querying as the table owner or superuser but not when querying as a
role that is subject to the row-level security policies on the same
table.

Our policies on the table are something like:
   
   CREATE POLICY "p1" ON schema1.table1
  FOR SELECT
  USING (ARRAY['x', 'y'] && schema1.our_procedure());

   CREATE POLICY "p2" ON schema1.table1
  FOR SELECT
  USING ((SELECT True FROM schema2.foo_authz p 
  WHERE p.sec_class = table1.sec_class
  LIMIT 1));

We've defined our_procedure as STABLE PARALLEL SAFE using plpgsql and
it makes a call to current_setting().  Should we ever expect to see
parallel query with policies such as these?  Are there any documented
limits where we might adjust our policy and gain parallel plans?

Thanks,


Karl

p.s. In case you are wondering about these odd policies, we are using
them to inject web client authorization decisions into our data access
layer...

We are running under a single application postgresql role and our app
passes web client authentication attributes as a session parameter
which we look up with current_setting() and convert back into an array
to intersect with the ACL content ['x', 'y'] in the first policy
above.

The second policy permits access to rows in a certain security class
managed by dynamic content in another mapping table (which of course
has limits as to how the web application will modify it).  This allows
some subset of rows to be exposed to less trusted clients, while the
first policy reveals all rows to a more trusted group of clients.

We also have some tables where the ACL is stored in a column and
intersected for each row, but that doesn't exist in this first case
where I tried to get a parallel plan.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is there a way to fix this ugliness

2016-09-09 Thread Karl Czajkowski
On Sep 10, Tim Uckun modulated:
> I am trying to get the child elements of a one to many table to be
> rolled up into a json field in the parent table. The query I am running
> is...

The problem is aggregating over the results of the left-outer join,
which introduces NULLs. You can try pushing that down into a sub-query
to create one image row per observation prior to joining:

  SELECT 
ob.id, 
im.images
  FROM observations ob
  LEFT OUTER JOIN (
SELECT 
  observation_id, 
  json_agg(row_to_json(im.*)) AS images
FROM images im
GROUP BY observation_id
  ) im ON (ob.id = im.observation_id) ;

you might use COALESCE in the top-level SELECT if you want to replace
any NULL im.images with a different empty value constant...


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Memory usage per session

2016-07-08 Thread Karl Czajkowski
On Jul 08, John McKown modulated:
...
> I think the "problem" that he is having is fixable only by changing how
> PostgreSQL itself works. His problem is a PL/pgSQL function which is
> 11K lines in length. When invoked, this function is "compiled" into a
> large tokenized parse tree. This parse tree is only usable in the
> session which invoked the the function. Apparently this parse tree
> takes a lot of memory.

Right.  I'm not entirely sure the original poster wants to hear
practical solutions, but I think there are three ways that someone
familar with Postgres would avoid this type of problem:

1. Connection pooling. Set the max connection limit and other tuning
   parameters appropriately for your workload and available
   resources. Don't get into a state as was described (thousands of
   open connections and only hundreds "active"). Make your client
   applications more disciplined.

2. Buy more RAM.  You can easily grow to 512GB in just basic
   dual-socket servers these days.  This hardware cost may well 
   be worth it to avoid human labor costs.

3. Rewrite or refactor such complex stored procedures in a different
   programming language such as C or Python, so your PL/pgsql stored
   procedures remain small glue around libraries of code.  Postgres
   makes it very trivial to extend the system with such procedural
   libraries.
   

Karl


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow query when the select list is big

2016-05-09 Thread Karl Czajkowski
On May 09, Sterpu Victor modulated:
> I tested it now, EXPLAIN ANALYZE is not showing LEFT JOIN at all if
> I don't select from the joined tables.
> Now is clear why the query is so mutch more efficient when I select
> less data.
> 
> Thank you
> 

With so many joins, you may want to experiment with postgresql
parameter tuning.  These parameters in particular can have a
significant impact on the plan choice and execution time:

   work_mem
   effective_cache_size

   from_collapse_limit
   join_collapse_limit

   geqo_threshold
   geqo_effort

Setting these to appropriately large values can make analytic queries
run much faster.  Of course, setting them too high can also make for
very bad plans which cause the DB server to over subscribe its memory
and start swapping...  it requires a bit of reading and a bit of
experimentation to find ideal settings for your environment.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Karl Czajkowski
On Apr 27, Tim van der Linden modulated:
...
> I'm joining three fairly large tables together, and it is slow. The tables 
> are:
> 
> - "reports": 6 million rows
> - "report_drugs": 20 million rows
> - "report_adverses": 20 million rows
> 
...
> All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" 
> columns.
> 
> The query:
> 
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id 
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> pain', 'back pain']) 
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;
> 

I would suggest a few experiments to see how you can modify the plans
available to the optimizer:

1. CREATE INDEX ON report_drugs (drug, rid)
2. CREATE INDEX ON report_adverses (adverse, rid)
3. CREATE INDEX ON report (id, created)

Re-run EXPLAIN ANALYZE of your query after each of these steps to see
how each one affects planning.  You might also try two variants of the
query at each step, with and without the ORDER BY.

Note, the index column order in (1) and (2) above is very important.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] RLS policy dump/restore failure due to elided type-casts

2016-04-20 Thread Karl Czajkowski
On Apr 20, Tom Lane modulated:
> Karl Czajkowski <kar...@isi.edu> writes:
> >   CREATE POLICY delete_stuff ON stuff 
> >   FOR DELETE USING ('example attribute value' = ANY ( ((SELECT 
> > current_attributes()))::text[] ));
> 
> Just out of curiosity, why are you doing it like that, and not simply
> 
> USING ('example attribute value' = ANY (current_attributes()))
> 
> It seems like you're going out of your way to complicate matters.
> 
>   regards, tom lane
> 

Going out of my way to complicate matters is my specialty. :-)

I spend more of my time writing programs that generate and transform
SQL from domain-specific languages, rather than writing SQL by
hand. In this case, I think I composed the subquery reflexively
without thinking that there is a more direct idiom for this case...


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] RLS policy dump/restore failure due to elided type-casts

2016-04-20 Thread Karl Czajkowski
Our row level policies require very explicit casts in order to be
accepted by the DB, but those casts are discarded in the dumped policy
statements. Thus, an attempt to reload the dump file fails.  I'm not
sure if the underlying problem is that the cast shouldn't be required
in the first place, or if the normalization applied to the policy
expression is just incorrect.

Below is a trivialized example that shows the problem while removing
complexity found in our real policies.  We are doing this to implement
differentiated policies in a web application, where the web server
stores authenticated web client context into session parameters and
our policies check against those attributes rather than DB roles.

To work around this, we have to maintain our policies in a separate
SQL file, manually edit the dumps, and reapply our working policies.
This is obviously undesirable in the long run, where dumps taken as
periodic backups are not directly usable to reconstruct the DB...


Karl


Create a test database with these contents:

  CREATE FUNCTION current_attributes() RETURNS text[] 
  STABLE AS $$
  BEGIN
RETURN (
  SELECT array_agg(value) 
  FROM json_array_elements_text(current_setting('ourapp.attributes')::json)
);
  EXCEPTION WHEN OTHERS THEN
RETURN NULL::text[];
  END;
  $$ LANGUAGE plpgsql;

  CREATE TABLE stuff (
value text PRIMARY KEY
  );

  CREATE POLICY delete_stuff ON stuff 
  FOR DELETE USING ('example attribute value' = ANY ( ((SELECT 
current_attributes()))::text[] ));


The output of pg_dump (and similarly the '\d' command in psql) drops
the cast:

  CREATE POLICY delete_stuff ON stuff 
  FOR DELETE TO PUBLIC USING (('example attribute value'::text = ANY (( SELECT 
current_attributes() AS current_attributes;


And this causes an error when executing the dump file on a new database:

  ERROR:  operator does not exist: text = text[]
  HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] row-level security, views and materialized views?

2016-01-11 Thread Karl Czajkowski
Hi, has there been any discussion as to supporting row-level security
policies on materialized views?  I didn't have much luck searching on
this topic. I see that PostgreSQL 9.5 refuses to create a policy on
a materialized view, consistent with the documentation focusing only on
tables.

First off, am I correct in understanding that access to regular views
will experience the same row-level security as if the client posed the
view-definition SQL directly in their query?  I.e. the view is
expanded like a macro, and the row-level security on constituent
tables applied during the query evaluation with respect to the
consuming client security context?

In an ideal but impractical world, I would use materialized views as
strictly a performance optimization, while getting the same row-level
security as described above.  However, I suspect in practice that the
security context of the client who defines or refreshes the view will
be applied instead, while the view is being materialized.  Is that
also correct?

So, in a more practical but secure world, I suppose I would want the
materialized view to work the same as "CREATE TABLE AS ..." so that at
least I could retain security properties from source tables in my
transformed output, and define moral equivalent row-level security
policies on my materialized view. Then, other clients querying my
extract could have the same security applied as if they queried the
source tables, so I don't necessarily leak rows to clients less
trusted than myself.

Thanks for any insights,


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Check old and new tuple in row-level policy?

2015-12-18 Thread Karl Czajkowski
On Dec 18, Stephen Frost modulated:

> I agree that it would be a nice addition, as I've said before.

OK, sorry I misunderstood earlier and thought you were dismissing the
idea as redundant with triggers.  Thanks for your patience!


> ... We certainly won't be adding it into 9.5 and it's getting pretty
> late for 9.6 too, but I'm anxious to see just how RLS is used in the
> field

If we want to start testing with 9.5 and emulate an environment
supporting NEW and OLD in the CHECK policy for UPDATE, can you
recommend how we should do that?  E.g., is there a particular trigger
idiom that would most closely replicate the RLS extension we've been
discussing (so that we'd know that a working policy set + triggers
could be translated to just RLS policies if that feature is added)?

We'd be happy to experiment with these sorts of policies in our
applications.  My intuition is that with this extension, we'd be able
to push down nearly all of our important policy enforcement into
PostgreSQL and guard against a slew of potential application
programming errors (compared to handling all this enforcement in our
application data access code)!


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Check old and new tuple in row-level policy?

2015-12-18 Thread Karl Czajkowski
On Dec 18, Stephen Frost modulated:

> Any UPDATE which requires SELECT rights on the table will require expr1
> to pass AND expr4 (the UPDATE's USING clause) to pass.  This is modeled
> directly off of our existing GRANT/ACL system.  The same is true for the
> other commands.  Note that we explicitly want an independent USING
> clause for expr4 as you may wish to reduce the set of rows which may be
> UPDATE'd to be less than the set which are visible via SELECT.
> 
> I anticipate adding the ability to have "restrictive" policies also, in
> the future.
> 

Right, I think I understand that. However, my argument was (and
remains) that I think the update conditions need to be able to access
OLD and NEW row values to decide whether the existing row security
context allows the new update content.  This update decision is
inherently different from select, insert, and delete decisions.

I don't think that two separate decisions are sufficient:

   1. OLD row can be updated  (the USING condition?)
   2. NEW row is acceptable (the CHECK condition?)

when considering the row lifecycle as having continuity of identity
and security context.  I think that the second decision needs to have
the option to compare OLD and NEW to decide that the new row is an
acceptable transform of the existing row, preserving whatever identity
and/or security context is important in a particular system of
policies.

As I understand this discussion, you are telling me to just use an
update trigger for this.  I can understand that as a workaround given
the current row-security options, but I do think that this kind of
decision is essential to row-security in realistically complex
applications that allow row mutation.  I don't think it is
particularly esoteric to suggest that the existing row security
context should limit possible future configurations of the row by
different parties.  That effect spans security context, row
identities, and regular non-identifying content.

Right now, it seems like the row-security model assumes the only
security context is an "owner" field carrying a user name and that is
never mutated (perhaps by a column-level privilege).  I think security
context can be richer than that, including more abstract roles,
classes, or attributes and having users be able to mutate that context
but only under the control of row-security policy.  It is not as
simple as the context being system managed only, or open to arbitrary
changes by admins and no change by others. Rather, the current context
may grant certain classes of user the ability to make only certain
coherent changes to that context.

Where mutation is different here is that a set of collaborating
parties can interpret one mutable row as a long-lived resource that
has an ongoing identity (or set of identities) and a coherent series
of row states protected by policy.  Without the ability for the OLD
row to contrain how the NEW row develops in an update policy, we lose
that coherence and have a fugue state for the application, with no
ability to trust the identity of content over time.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Check old and new tuple in row-level policy?

2015-12-17 Thread Karl Czajkowski
I feel that the fundamental question here is about the semantics of
"row" in row security.  The core question is whether UPDATE has a
different semantics with respect to row identity and security context
than a sequence of DELETE, INSERT.

At the moment, the documented solution feels like "tuple security" and
supports an idea of security context encoded into a tuple such that a
policy condition can decide whether its context and data configuration
is consistent with the session context.  This makes sense in a pure
relational calculus, but ignores the practical deviation of SQL where
there are mutable records.

I am hoping for "record security" that supports an idea of security
context encoded into a record such that the context in the existing
record can constrain how the update of the record is formed.  This
includes policies that might differentiate updates to key versus
non-key fields in the record, deciding in application terms whether
the record has an identity and context that carries forward through
the UPDATE or whether it is merely sugar for a DELETE followed by
INSERT.

Ideally, I'd be able to write a policy that has conditions for each
category of operation:

   POLICY FOR SELECT WITH expr1 
   POLICY FOR INSERT WITH expr2
   POLICY FOR DELETE WITH expr3
   POLICY FOR UPDATE WITH expr4

where expr1 would always be used to decide whether the current tuple
is visible to the query engine, expr2 would always be used to validate
new rows, expr3 would always be used to authorize row deletion, and
expr4 would be able to authorize row replacement using NEW and OLD
value comparisons.

An actual SQL UPDATE for a row visible according to expr1 could be
authorized if expr4 allows it *or* if expr3 and expr2 would allow a
DELETE followed by INSERT.  Where this becomes interesting is when a
session context with insufficient privilege to do the DELETE and
INSERT sequence is still allowed to do the UPDATE because their
request fits the narrower confines of the expr4 policy.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Check old and new tuple in row-level policy?

2015-12-16 Thread Karl Czajkowski
I've been trying to learn more about the row-security policies but
coming up short in my searches.  Was there any consideration to
allowing access to both old and new row tuples in a POLICY ... CHECK
expression?  This idiom I've seen in the lower level rule and trigger
systems seems like it would allow for much more powerful row-security
policies.

The simple illustrations of row-security policy always seem to
consider an 'owner' field in the tuple compared to current_user.  If
you could consult both old and new values, you could generalize to
storing application ACLs in rows and using those ACLs to decide row
access while also ensuring that ACLs cannot be changed in ways
inconsistent with the privilege level of the current user.

For example, if the current user is in the old ACL value, allow them
to modify the ACL otherwise require that the new ACL value be equal to
the old ACL value.  This would allow a user to be given write access
to some columns while restricting others, but on a row-by-row basis.

Right now, as I understand it, you can only compare the old values to
session state in the WITH condition and new values to the session
state in the CHECK condition, but never consider old and new values
simultaneously.  This excludes a wide and useful gray area between
no trust and full trust to amend row content.


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Check old and new tuple in row-level policy?

2015-12-16 Thread Karl Czajkowski
On Dec 16, Stephen Frost modulated:

> There is still a need to refer back to some kind of state that is
> external to the table under consideration to determine what the session
> level access is, no?  Even if the ACLs are in a table somewhere, how do
> you know who the current user is?
> 

Yes, I was assuming session variables of some sort.

Right now, I am looking to use the row-security policies for a web
application. We would just have the application server authenticate as
a service role and set session variables holding the web client's
authentication context.  The WITH and CHECK conditions would compare
row content against these session variables to enforce web client
authorization decisions underneath our application queries.

The row content includes ownership and ACL-like content as well as
other application content subject to special access rules expressed in
terms of the ownership and ACLs. If we authenticated users to the
database, we would want to consult current_user and something like
current_roles (an array of all roles granted to the
current_user). Instead, we'll be using analogous session context
asserted to us by the web service.

We've done previous systems where we compile all the application
policy checks into the SQL queries generated by the application, but I
think it would be more appropriate to split these out and have
generalized enforcement at the database level. It feels like a close
but not perfect fit already.  It's fraught with perile to handle all
the data visibility rules while generating any application-level data
filtering expressions, joins, etc.!

I'm trying to boil out some simple illustrations. Discussing an entire
cohesive system is difficult and probably counter-productive...

The technical idea is to have policies that consider the relationship
between old data, new data, and session context to only allow
particular state transitions for row UPDATE. For INSERT, SELECT, and
DELETE, I think the current policy model is already sufficient.

Just a few possible use cases to illustrate mixed tests of old and new
row values:

  1. Interlocks between record states and supplemental access
 rights. A community might not allow records to be marked readable
 until they have been giving a passing QA grade. A subsequent
 consumer might revise the to a failing grade, but not revoke the
 current access rights due to transparency rules.

  2. State-transition rules for specific values. Enforce that regular
 users can only move a workflow state in along normal edges,
 while an admin user may be able to intervene and make abnormal
 transitions.  Or, allow users to fill in "missing" data such as
 replacing NULL or other defaults with better values, but only
 administrators can erase data back to NULL states.

  3. Classification systems or other quasi-monotonic permissions
 models where a user may advance the access class of a record in
 one direction, but only special administrators can reverse the
 direction.

 A. A publishing system might make it easy to draft data in
smaller, private groups but once published it is hard to
retract things from the public record.

 B. Confidentiality systems might do the opposite, allowing things
to be flagged as sensitive and locked down more easily than
relaxing access restrictions.

 C. Community-based delegation systems might make it easy to
"share" records with additional consumers by adding to an ACL
but only the more privileged owner of the row can remove
entries from the ACL to "unshare".

  4. Custody or provenance records.  Certain unusual state-transitions
 of data values may only be allowed if an explantory record is
 appended to a small log array stored in the row.

I think that there is significant overlap between authorization, state
transition models, and data integrity constraints once you start
considering collaborative applications with mutable records.

The next big leap beyond considering NEW and OLD values during
condition checks would be to use scalar subqueries to examine the row
within the context of other existing rows in the same or different
tables.  I have not looked to see if this is possible in the current
policy system, but I imagine we would try hard to avoid doing this due
to performance implications, even if it is allowed...


Karl



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general