Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Josh Elser Sun, 16 Oct 2016 21:04:11 -0700

A nice round number to track this work:https://issues.apache.org/jira/browse/ACCUMULO-4500


Josh Elser wrote:

Thanks for the reply, Mike.


Mike Drob wrote:

Hiding this behind the SystemPermission.SYSTEM permission might be
sufficient.


Superb. Personally, I wouldn't want to piggy-back on SYSTEM.SYSTEM
(because that permission implies a lot of other things too), but that's
an implementation detail we can hash out later.

In a situation where Accumulo data is on an encrypted volume, or the
rfiles
themselves are encrypted, then a root user wouldn't be able to read the
rfiles to generate the histograms. This matches my initial mental
model of
an admin user that doesn't necessarily need to access to data and data
users that don't have access to admin commands. There is no all powerful
root user that can do everything and read everything.


I agree with you that we should not assume an admin has the ability to
read all data in all cases. In some cases it might, but the encrypted
files is one good example that guarantees that cannot happen. I do draw
a distinction between being able to read all data and generating a count
of the unique visibility labels. I think that, in most cases, such a
sketch on the visibilities in the system does not leak any sensitive
data; however, hiding that access behind a system permission is a good
compromise for those whose use-cases I haven't considered :)

Have we ever discussed an "emergency access, give me all the permissions"
model? I feel like I've heard John Vines mention this before, I think.
This
would be a reasonable extensions of that.


I don't recall hearing of that one before, and I don't think I agree
that this proposal is an extension of it. The number of records in the
system and the visibility of them are purely "metadata" which do not
expose identifying information about the actual data.

Mike

On Fri, Oct 14, 2016 at 11:06 AM, Josh Elser<[email protected]> wrote:

Ping Marc/Mike D.


Josh Elser wrote:

Thanks, Marc. Follow-on question(s) for you:

Do you think _any_ such approach should never be pursued by Accumulo
(reading into your other replies about doing it outside of Accumulo)?
Are the permissions that we have in place not sufficient to protect
such
"metadata"?

Or, would such a feature be "OK" to you if it required some degree of
additional manual steps by the administrator? (if so, what steps do you
think make this acceptable)

In a similar vein, how do you see this broadening the scope of the
Accumulo security model in an invalid manner? e.g. Administrators
should
never be able to see such information. Someone with sufficient
access to
a system would already be able to bypass Accumulo's security
mechanisms.
There are a number of vectors already were a sufficiently-credentialed
individual could figure out this information (and more).

Ultimately, I see Accumulo's main security tenet as "users should never
be allowed to see more data than they are authorized to see". Maybe
it's
my interpretation of that or the scope of how your think the proposed
feature would function, but I'd be very interested in hearing more
about
what you think.

Marc P. wrote:

My point for discussing implementation outside of accumulo is
because I
think it does invalidate a core tenant

On Wed, Oct 12, 2016, 12:26 PM Josh Elser<[email protected]> wrote:

Again, can we please bring this discussion back from discussions of

implementations to security?

Does the fact that you three were discussing implementations imply
that
you do not think this invalidates one of the core tenets (security
first) of Accumulo?

Christopher wrote:

Keith, Russ, myself (and possible others) were discussing this at
the
hackathon after the Accumulo Summit, and I think our consensus were
basically this:

We need a generic pluggable mechanism for injecting arbitrary user

counters

into the RFiles. We can then use these counters in custom compaction
strategies, or other analysis. We can aggregate these counters at
the
tablet, and table levels, and expose them in the API.

These counters could store information about visibility frequencies,

number

of delete entries, etc.

The interface might just be a Function<Entry<Key,Value>,Map<String,

Long>>.

In the discussion, there were lots of variations on the theme,
though.

So,

the actual implementation could vary. But, having something like
this

could

support a large number of use cases beyond just the histogram case.

On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<[email protected]>

wrote:

Trivially. We could do something more intelligent like also cache

it in
metadata (updating with compactions). Don't read too much into the
implementation at this point; it was just the first idea I had
about

how we
could do it :). I'm more concerned with the idea and its security

implications right now.

In general, it seems like people are ok with it protected by a new
permission role. Do you have more to add, Mike? Was your comment
based

on
your interpretation of how Accumulo works or more a concern about

implementing such a feature?

On Oct 11, 2016 21:29,<[email protected]> wrote:

So, to get the set of visibilities used in a table, we would
have to
open

all of the rfiles?

-----Original Message-----

From: Dylan Hutchison [mailto:[email protected]]
Sent: Tuesday, October 11, 2016 3:43 PM
To: Accumulo Dev List
Subject: Re: [DISCUSS] Would a visibility histogram on a table be

harmful?

Interesting idea. It begs the question: should we allow any
custom

index at

the RFile level? If RFile indexes were user-extensible, then a

visibility index

would be something any developer could write. That said, we can
still
include such an index as an example, and if we did it could be
used by

the

Accumulo monitor.

The RFile-level sampling followed this path. I would support
further

work

similar to it, though I admit I don't know how difficult a job it

entails.

Bonus points if the index information could be accessed from
iterators

the

same way that sampled data can.

I can't speak to the appropriateness of visibility histograms
on the

monitor

*by default*, but it would be a strictly useful feature if it
could be

enabled via

a conf option.


On Tue, Oct 11, 2016 at 12:20 PM, Josh
Elser<[email protected]>

wrote:

Today at Accumulo Summit, our own Russ Weeks gave a talk. One
topic
he

mentioned was the lack of insight into the distribution of data

marked

with certain visibilities in a table. He presented an example

similar

to this:
Image a hypothetical system backed by Accumulo which stores
medical

information. There are three labels in the system: PRIVATE,
ANONYMIZED, and PUBLIC. PRIVATE data is that which could
reasonably

be

considered to identify the individual. ANONYMIZED data is some

altered

version of the attribute that retains some portion of the original

value, but is missing enough context to not identify the
individual
(e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is
for
attributes which are cannot identify the individual.

Doctors would be able to read the PRIVATE data, while
researchers
could only read the ANONYMIZED and PUBLIC data. This leads to a
question: how much of each kind of data is in the system?
Without
knowing how much data is in the system, how can some application
developer (who does not have the ability to read all of the
PRIVATE
data) know that their application is returning an reasonably
correct
amount of data? (there are many examples of questions which
could be
answer on this data alone)

Concretely, this histogram would look like (50 records with
PRIVATE,
50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):

```
PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20
```

Technically, I think this would actually be relatively simple to
implement. Inside of each RFile, we could maintain some
histogram of
the visibilities observed in that file. This would allow us
to very
easily report how much data in each table has each visibility
label.

However, would this feature be harmful to one of the core
tenants of
Accumulo? Or, is acknowledging the existence of data in Accumulo
with
a certain visibility acceptable? Would a new permission to
use such

an

API to access this information be sufficient to protect the data?

- Josh

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to