[DISCUSS] Would a visibility histogram on a table be harmful?

Josh Elser Tue, 11 Oct 2016 12:21:03 -0700

Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic hementioned was the lack of insight into the distribution of data markedwith certain visibilities in a table. He presented an example similar tothis:

Image a hypothetical system backed by Accumulo which stores medicalinformation. There are three labels in the system: PRIVATE, ANONYMIZED,and PUBLIC. PRIVATE data is that which could reasonably be considered toidentify the individual. ANONYMIZED data is some altered version of theattribute that retains some portion of the original value, but ismissing enough context to not identify the individual (e.g. convertingthe name "Josh Elser" to "J E"). PUBLIC data is for attributes which arecannot identify the individual.

Doctors would be able to read the PRIVATE data, while researchers couldonly read the ANONYMIZED and PUBLIC data. This leads to a question: howmuch of each kind of data is in the system? Without knowing how muchdata is in the system, how can some application developer (who does nothave the ability to read all of the PRIVATE data) know that theirapplication is returning an reasonably correct amount of data? (thereare many examples of questions which could be answer on this data alone)

Concretely, this histogram would look like (50 records with PRIVATE, 50with ANONYMIZED, and 20 with PUBLIC; 120 records total):


```
PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20
```

Technically, I think this would actually be relatively simple toimplement. Inside of each RFile, we could maintain some histogram of thevisibilities observed in that file. This would allow us to very easilyreport how much data in each table has each visibility label.

However, would this feature be harmful to one of the core tenants ofAccumulo? Or, is acknowledging the existence of data in Accumulo with acertain visibility acceptable? Would a new permission to use such an APIto access this information be sufficient to protect the data?


- Josh

[DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to