Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Josh Elser Wed, 12 Oct 2016 12:59:05 -0700

I was envisioning public API protected by a system permission (implyingsome Thrift RPC as well) if that is an important distinction for thosewith concerns. I am hoping to get more info from Mike/Marc about whythey feel this is insufficient WRT Accumulo's security model.


Keith Turner wrote:

We did discuss making this info available through the public API (and
adding thrift calls to gather it).   We discussed the possibility of
adding a new permission.


On Wed, Oct 12, 2016 at 2:35 PM, ivan bella<[email protected]>  wrote:

I do not see how this invalidates any security of the system unless you are 
summarizing these counters and making them available through a thrift or other 
call; don't do that unless other security is put in place.  To get a summary I 
would think you would have to use a separate utility to scrape the rfiles.  
This metadata should only be accessible to a system administrator.  The BIG 
presumption here is that is is significantly faster to grab this metadata data 
out than it is to scan all of the keys in the rfile.

On October 12, 2016 at 1:41 PM Josh Elser<[email protected]>  wrote:

Thanks, Marc. Follow-on question(s) for you:

Do you think _any_ such approach should never be pursued by Accumulo
(reading into your other replies about doing it outside of Accumulo)?
Are the permissions that we have in place not sufficient to protect such
"metadata"?

Or, would such a feature be "OK" to you if it required some degree of
additional manual steps by the administrator? (if so, what steps do you
think make this acceptable)

In a similar vein, how do you see this broadening the scope of the
Accumulo security model in an invalid manner? e.g. Administrators should
never be able to see such information. Someone with sufficient access to
a system would already be able to bypass Accumulo's security mechanisms.
There are a number of vectors already were a sufficiently-credentialed
individual could figure out this information (and more).

Ultimately, I see Accumulo's main security tenet as "users should never
be allowed to see more data than they are authorized to see". Maybe it's
my interpretation of that or the scope of how your think the proposed
feature would function, but I'd be very interested in hearing more about
what you think.

Marc P. wrote:

My point for discussing implementation outside of accumulo is because I
think it does invalidate a core tenant

On Wed, Oct 12, 2016, 12:26 PM Josh Elser<[email protected]>  wrote:

Again, can we please bring this discussion back from discussions of
implementations to security?

Does the fact that you three were discussing implementations imply that
you do not think this invalidates one of the core tenets (security
first) of Accumulo?

Christopher wrote:

Keith, Russ, myself (and possible others) were discussing this at the
hackathon after the Accumulo Summit, and I think our consensus were
basically this:

We need a generic pluggable mechanism for injecting arbitrary user
counters
into the RFiles. We can then use these counters in custom compaction
strategies, or other analysis. We can aggregate these counters at the
tablet, and table levels, and expose them in the API.

These counters could store information about visibility frequencies,
number
of delete entries, etc.

The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.
In the discussion, there were lots of variations on the theme, though.
So,
the actual implementation could vary. But, having something like this
could
support a large number of use cases beyond just the histogram case.

On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<[email protected]>
wrote:

Trivially. We could do something more intelligent like also cache it in
metadata (updating with compactions). Don't read too much into the
implementation at this point; it was just the first idea I had about
how we
could do it :). I'm more concerned with the idea and its security
implications right now.

In general, it seems like people are ok with it protected by a new
permission role. Do you have more to add, Mike? Was your comment based
on
your interpretation of how Accumulo works or more a concern about
implementing such a feature?

On Oct 11, 2016 21:29,<[email protected]>  wrote:

So, to get the set of visibilities used in a table, we would have to
open
all of the rfiles?

-----Original Message-----
From: Dylan Hutchison [mailto:[email protected]]
Sent: Tuesday, October 11, 2016 3:43 PM
To: Accumulo Dev List
Subject: Re: [DISCUSS] Would a visibility histogram on a table be
harmful?
Interesting idea. It begs the question: should we allow any custom
index at
the RFile level? If RFile indexes were user-extensible, then a
visibility index
would be something any developer could write. That said, we can still
include such an index as an example, and if we did it could be used by
the
Accumulo monitor.

The RFile-level sampling followed this path. I would support further
work
similar to it, though I admit I don't know how difficult a job it
entails.
Bonus points if the index information could be accessed from iterators
the
same way that sampled data can.

I can't speak to the appropriateness of visibility histograms on the
monitor
*by default*, but it would be a strictly useful feature if it could be
enabled via
a conf option.

On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<[email protected]>
wrote:

Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
he
mentioned was the lack of insight into the distribution of data
marked
with certain visibilities in a table. He presented an example similar
to this:
Image a hypothetical system backed by Accumulo which stores medical
information. There are three labels in the system: PRIVATE,
ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
be
considered to identify the individual. ANONYMIZED data is some
altered
version of the attribute that retains some portion of the original
value, but is missing enough context to not identify the individual
(e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
attributes which are cannot identify the individual.

Doctors would be able to read the PRIVATE data, while researchers
could only read the ANONYMIZED and PUBLIC data. This leads to a
question: how much of each kind of data is in the system? Without
knowing how much data is in the system, how can some application
developer (who does not have the ability to read all of the PRIVATE
data) know that their application is returning an reasonably correct
amount of data? (there are many examples of questions which could be
answer on this data alone)

Concretely, this histogram would look like (50 records with PRIVATE,
50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):

PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20

Technically, I think this would actually be relatively simple to
implement. Inside of each RFile, we could maintain some histogram of
the visibilities observed in that file. This would allow us to very
easily report how much data in each table has each visibility label.

However, would this feature be harmful to one of the core tenants of
Accumulo? Or, is acknowledging the existence of data in Accumulo with
a certain visibility acceptable? Would a new permission to use such
an
API to access this information be sufficient to protect the data?

*   Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to