My point for discussing implementation outside of accumulo is because I think it does invalidate a core tenant
On Wed, Oct 12, 2016, 12:26 PM Josh Elser <[email protected]> wrote: > Again, can we please bring this discussion back from discussions of > implementations to security? > > Does the fact that you three were discussing implementations imply that > you do not think this invalidates one of the core tenets (security > first) of Accumulo? > > Christopher wrote: > > Keith, Russ, myself (and possible others) were discussing this at the > > hackathon after the Accumulo Summit, and I think our consensus were > > basically this: > > > > We need a generic pluggable mechanism for injecting arbitrary user > counters > > into the RFiles. We can then use these counters in custom compaction > > strategies, or other analysis. We can aggregate these counters at the > > tablet, and table levels, and expose them in the API. > > > > These counters could store information about visibility frequencies, > number > > of delete entries, etc. > > > > The interface might just be a Function<Entry<Key,Value>,Map<String, > Long>>. > > > > In the discussion, there were lots of variations on the theme, though. > So, > > the actual implementation could vary. But, having something like this > could > > support a large number of use cases beyond just the histogram case. > > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<[email protected]> > wrote: > > > >> Trivially. We could do something more intelligent like also cache it in > >> metadata (updating with compactions). Don't read too much into the > >> implementation at this point; it was just the first idea I had about > how we > >> could do it :). I'm more concerned with the idea and its security > >> implications right now. > >> > >> In general, it seems like people are ok with it protected by a new > >> permission role. Do you have more to add, Mike? Was your comment based > on > >> your interpretation of how Accumulo works or more a concern about > >> implementing such a feature? > >> > >> On Oct 11, 2016 21:29,<[email protected]> wrote: > >> > >>> So, to get the set of visibilities used in a table, we would have to > open > >>> all of the rfiles? > >>> > >>>> -----Original Message----- > >>>> From: Dylan Hutchison [mailto:[email protected]] > >>>> Sent: Tuesday, October 11, 2016 3:43 PM > >>>> To: Accumulo Dev List > >>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be > >>> harmful? > >>>> Interesting idea. It begs the question: should we allow any custom > >>> index at > >>>> the RFile level? If RFile indexes were user-extensible, then a > >>> visibility index > >>>> would be something any developer could write. That said, we can still > >>>> include such an index as an example, and if we did it could be used by > >>> the > >>>> Accumulo monitor. > >>>> > >>>> The RFile-level sampling followed this path. I would support further > >>> work > >>>> similar to it, though I admit I don't know how difficult a job it > >>> entails. > >>>> Bonus points if the index information could be accessed from iterators > >>> the > >>>> same way that sampled data can. > >>>> > >>>> I can't speak to the appropriateness of visibility histograms on the > >>> monitor > >>>> *by default*, but it would be a strictly useful feature if it could be > >>> enabled via > >>>> a conf option. > >>>> > >>>> > >>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<[email protected]> > >>> wrote: > >>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic > >> he > >>>>> mentioned was the lack of insight into the distribution of data > >> marked > >>>>> with certain visibilities in a table. He presented an example similar > >>> to this: > >>>>> Image a hypothetical system backed by Accumulo which stores medical > >>>>> information. There are three labels in the system: PRIVATE, > >>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably > >> be > >>>>> considered to identify the individual. ANONYMIZED data is some > >> altered > >>>>> version of the attribute that retains some portion of the original > >>>>> value, but is missing enough context to not identify the individual > >>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for > >>>>> attributes which are cannot identify the individual. > >>>>> > >>>>> Doctors would be able to read the PRIVATE data, while researchers > >>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a > >>>>> question: how much of each kind of data is in the system? Without > >>>>> knowing how much data is in the system, how can some application > >>>>> developer (who does not have the ability to read all of the PRIVATE > >>>>> data) know that their application is returning an reasonably correct > >>>>> amount of data? (there are many examples of questions which could be > >>>>> answer on this data alone) > >>>>> > >>>>> Concretely, this histogram would look like (50 records with PRIVATE, > >>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total): > >>>>> > >>>>> ``` > >>>>> PRIVATE: 50 > >>>>> ANONYMIZED: 50 > >>>>> PUBLIC: 20 > >>>>> ``` > >>>>> > >>>>> Technically, I think this would actually be relatively simple to > >>>>> implement. Inside of each RFile, we could maintain some histogram of > >>>>> the visibilities observed in that file. This would allow us to very > >>>>> easily report how much data in each table has each visibility label. > >>>>> > >>>>> However, would this feature be harmful to one of the core tenants of > >>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with > >>>>> a certain visibility acceptable? Would a new permission to use such > >> an > >>>>> API to access this information be sufficient to protect the data? > >>>>> > >>>>> - Josh > >>>>> > >>> > > >
