So, to get the set of visibilities used in a table, we would have to open all of the rfiles?
> -----Original Message----- > From: Dylan Hutchison [mailto:[email protected]] > Sent: Tuesday, October 11, 2016 3:43 PM > To: Accumulo Dev List > Subject: Re: [DISCUSS] Would a visibility histogram on a table be harmful? > > Interesting idea. It begs the question: should we allow any custom index at > the RFile level? If RFile indexes were user-extensible, then a visibility > index > would be something any developer could write. That said, we can still > include such an index as an example, and if we did it could be used by the > Accumulo monitor. > > The RFile-level sampling followed this path. I would support further work > similar to it, though I admit I don't know how difficult a job it entails. > Bonus points if the index information could be accessed from iterators the > same way that sampled data can. > > I can't speak to the appropriateness of visibility histograms on the monitor > *by default*, but it would be a strictly useful feature if it could be > enabled via > a conf option. > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <[email protected]> wrote: > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he > > mentioned was the lack of insight into the distribution of data marked > > with certain visibilities in a table. He presented an example similar to > > this: > > > > Image a hypothetical system backed by Accumulo which stores medical > > information. There are three labels in the system: PRIVATE, > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably be > > considered to identify the individual. ANONYMIZED data is some altered > > version of the attribute that retains some portion of the original > > value, but is missing enough context to not identify the individual > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for > > attributes which are cannot identify the individual. > > > > Doctors would be able to read the PRIVATE data, while researchers > > could only read the ANONYMIZED and PUBLIC data. This leads to a > > question: how much of each kind of data is in the system? Without > > knowing how much data is in the system, how can some application > > developer (who does not have the ability to read all of the PRIVATE > > data) know that their application is returning an reasonably correct > > amount of data? (there are many examples of questions which could be > > answer on this data alone) > > > > Concretely, this histogram would look like (50 records with PRIVATE, > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > > > ``` > > PRIVATE: 50 > > ANONYMIZED: 50 > > PUBLIC: 20 > > ``` > > > > Technically, I think this would actually be relatively simple to > > implement. Inside of each RFile, we could maintain some histogram of > > the visibilities observed in that file. This would allow us to very > > easily report how much data in each table has each visibility label. > > > > However, would this feature be harmful to one of the core tenants of > > Accumulo? Or, is acknowledging the existence of data in Accumulo with > > a certain visibility acceptable? Would a new permission to use such an > > API to access this information be sufficient to protect the data? > > > > - Josh > >
