I've always been under the impression that accumulo was not supposed to confirm the existence of data that a user did not have permission to read.
On Tue, Oct 11, 2016, 2:20 PM Josh Elser <josh.el...@gmail.com> wrote: > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he > mentioned was the lack of insight into the distribution of data marked > with certain visibilities in a table. He presented an example similar to > this: > > Image a hypothetical system backed by Accumulo which stores medical > information. There are three labels in the system: PRIVATE, ANONYMIZED, > and PUBLIC. PRIVATE data is that which could reasonably be considered to > identify the individual. ANONYMIZED data is some altered version of the > attribute that retains some portion of the original value, but is > missing enough context to not identify the individual (e.g. converting > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are > cannot identify the individual. > > Doctors would be able to read the PRIVATE data, while researchers could > only read the ANONYMIZED and PUBLIC data. This leads to a question: how > much of each kind of data is in the system? Without knowing how much > data is in the system, how can some application developer (who does not > have the ability to read all of the PRIVATE data) know that their > application is returning an reasonably correct amount of data? (there > are many examples of questions which could be answer on this data alone) > > Concretely, this histogram would look like (50 records with PRIVATE, 50 > with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > ``` > PRIVATE: 50 > ANONYMIZED: 50 > PUBLIC: 20 > ``` > > Technically, I think this would actually be relatively simple to > implement. Inside of each RFile, we could maintain some histogram of the > visibilities observed in that file. This would allow us to very easily > report how much data in each table has each visibility label. > > However, would this feature be harmful to one of the core tenants of > Accumulo? Or, is acknowledging the existence of data in Accumulo with a > certain visibility acceptable? Would a new permission to use such an API > to access this information be sufficient to protect the data? > > - Josh >