Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Sean Busbey Tue, 11 Oct 2016 13:48:21 -0700

I think a new permission would cover the concern about leaking
meta-information. Even if only the administrative user could see the
histogram (since they can see all data), that'd be a gain.


-- 
Sean Busbey

On Oct 11, 2016 16:33, "Mike Drob" <[email protected]> wrote:

> I've always been under the impression that accumulo was not supposed to
> confirm the existence of data that a user did not have permission to read.
>
> On Tue, Oct 11, 2016, 2:20 PM Josh Elser <[email protected]> wrote:
>
> > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
> > mentioned was the lack of insight into the distribution of data marked
> > with certain visibilities in a table. He presented an example similar to
> > this:
> >
> > Image a hypothetical system backed by Accumulo which stores medical
> > information. There are three labels in the system: PRIVATE, ANONYMIZED,
> > and PUBLIC. PRIVATE data is that which could reasonably be considered to
> > identify the individual. ANONYMIZED data is some altered version of the
> > attribute that retains some portion of the original value, but is
> > missing enough context to not identify the individual (e.g. converting
> > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are
> > cannot identify the individual.
> >
> > Doctors would be able to read the PRIVATE data, while researchers could
> > only read the ANONYMIZED and PUBLIC data. This leads to a question: how
> > much of each kind of data is in the system? Without knowing how much
> > data is in the system, how can some application developer (who does not
> > have the ability to read all of the PRIVATE data) know that their
> > application is returning an reasonably correct amount of data? (there
> > are many examples of questions which could be answer on this data alone)
> >
> > Concretely, this histogram would look like (50 records with PRIVATE, 50
> > with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >
> > ```
> > PRIVATE: 50
> > ANONYMIZED: 50
> > PUBLIC: 20
> > ```
> >
> > Technically, I think this would actually be relatively simple to
> > implement. Inside of each RFile, we could maintain some histogram of the
> > visibilities observed in that file. This would allow us to very easily
> > report how much data in each table has each visibility label.
> >
> > However, would this feature be harmful to one of the core tenants of
> > Accumulo? Or, is acknowledging the existence of data in Accumulo with a
> > certain visibility acceptable? Would a new permission to use such an API
> > to access this information be sufficient to protect the data?
> >
> > - Josh
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to