I think a new permission would cover the concern about leaking meta-information. Even if only the administrative user could see the histogram (since they can see all data), that'd be a gain.
-- Sean Busbey On Oct 11, 2016 16:33, "Mike Drob" <[email protected]> wrote: > I've always been under the impression that accumulo was not supposed to > confirm the existence of data that a user did not have permission to read. > > On Tue, Oct 11, 2016, 2:20 PM Josh Elser <[email protected]> wrote: > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he > > mentioned was the lack of insight into the distribution of data marked > > with certain visibilities in a table. He presented an example similar to > > this: > > > > Image a hypothetical system backed by Accumulo which stores medical > > information. There are three labels in the system: PRIVATE, ANONYMIZED, > > and PUBLIC. PRIVATE data is that which could reasonably be considered to > > identify the individual. ANONYMIZED data is some altered version of the > > attribute that retains some portion of the original value, but is > > missing enough context to not identify the individual (e.g. converting > > the name "Josh Elser" to "J E"). PUBLIC data is for attributes which are > > cannot identify the individual. > > > > Doctors would be able to read the PRIVATE data, while researchers could > > only read the ANONYMIZED and PUBLIC data. This leads to a question: how > > much of each kind of data is in the system? Without knowing how much > > data is in the system, how can some application developer (who does not > > have the ability to read all of the PRIVATE data) know that their > > application is returning an reasonably correct amount of data? (there > > are many examples of questions which could be answer on this data alone) > > > > Concretely, this histogram would look like (50 records with PRIVATE, 50 > > with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > > > ``` > > PRIVATE: 50 > > ANONYMIZED: 50 > > PUBLIC: 20 > > ``` > > > > Technically, I think this would actually be relatively simple to > > implement. Inside of each RFile, we could maintain some histogram of the > > visibilities observed in that file. This would allow us to very easily > > report how much data in each table has each visibility label. > > > > However, would this feature be harmful to one of the core tenants of > > Accumulo? Or, is acknowledging the existence of data in Accumulo with a > > certain visibility acceptable? Would a new permission to use such an API > > to access this information be sufficient to protect the data? > > > > - Josh > > >
