RE: [DISCUSS] Would a visibility histogram on a table be harmful?

Josh Elser Tue, 11 Oct 2016 19:07:03 -0700

Trivially. We could do something more intelligent like also cache it in
metadata (updating with compactions). Don't read too much into the
implementation at this point; it was just the first idea I had about how we
could do it :). I'm more concerned with the idea and its security
implications right now.


In general, it seems like people are ok with it protected by a new
permission role. Do you have more to add, Mike? Was your comment based on
your interpretation of how Accumulo works or more a concern about
implementing such a feature?

On Oct 11, 2016 21:29, <[email protected]> wrote:

> So, to get the set of visibilities used in a table, we would have to open
> all of the rfiles?
>
> > -----Original Message-----
> > From: Dylan Hutchison [mailto:[email protected]]
> > Sent: Tuesday, October 11, 2016 3:43 PM
> > To: Accumulo Dev List
> > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> harmful?
> >
> > Interesting idea.  It begs the question: should we allow any custom
> index at
> > the RFile level?  If RFile indexes were user-extensible, then a
> visibility index
> > would be something any developer could write.  That said, we can still
> > include such an index as an example, and if we did it could be used by
> the
> > Accumulo monitor.
> >
> > The RFile-level sampling followed this path.  I would support further
> work
> > similar to it, though I admit I don't know how difficult a job it
> entails.
> > Bonus points if the index information could be accessed from iterators
> the
> > same way that sampled data can.
> >
> > I can't speak to the appropriateness of visibility histograms on the
> monitor
> > *by default*, but it would be a strictly useful feature if it could be
> enabled via
> > a conf option.
> >
> >
> > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <[email protected]>
> wrote:
> >
> > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > > mentioned was the lack of insight into the distribution of data marked
> > > with certain visibilities in a table. He presented an example similar
> to this:
> > >
> > > Image a hypothetical system backed by Accumulo which stores medical
> > > information. There are three labels in the system: PRIVATE,
> > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably be
> > > considered to identify the individual. ANONYMIZED data is some altered
> > > version of the attribute that retains some portion of the original
> > > value, but is missing enough context to not identify the individual
> > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> > > attributes which are cannot identify the individual.
> > >
> > > Doctors would be able to read the PRIVATE data, while researchers
> > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > question: how much of each kind of data is in the system? Without
> > > knowing how much data is in the system, how can some application
> > > developer (who does not have the ability to read all of the PRIVATE
> > > data) know that their application is returning an reasonably correct
> > > amount of data? (there are many examples of questions which could be
> > > answer on this data alone)
> > >
> > > Concretely, this histogram would look like (50 records with PRIVATE,
> > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > >
> > > ```
> > > PRIVATE: 50
> > > ANONYMIZED: 50
> > > PUBLIC: 20
> > > ```
> > >
> > > Technically, I think this would actually be relatively simple to
> > > implement. Inside of each RFile, we could maintain some histogram of
> > > the visibilities observed in that file. This would allow us to very
> > > easily report how much data in each table has each visibility label.
> > >
> > > However, would this feature be harmful to one of the core tenants of
> > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
> > > a certain visibility acceptable? Would a new permission to use such an
> > > API to access this information be sufficient to protect the data?
> > >
> > > - Josh
> > >
>
>

RE: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to