Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Marc P. Wed, 12 Oct 2016 09:35:29 -0700

My point for discussing implementation outside of accumulo is because I
think it does invalidate a core tenant


On Wed, Oct 12, 2016, 12:26 PM Josh Elser <[email protected]> wrote:

> Again, can we please bring this discussion back from discussions of
> implementations to security?
>
> Does the fact that you three were discussing implementations imply that
> you do not think this invalidates one of the core tenets (security
> first) of Accumulo?
>
> Christopher wrote:
> > Keith, Russ, myself (and possible others) were discussing this at the
> > hackathon after the Accumulo Summit, and I think our consensus were
> > basically this:
> >
> > We need a generic pluggable mechanism for injecting arbitrary user
> counters
> > into the RFiles. We can then use these counters in custom compaction
> > strategies, or other analysis. We can aggregate these counters at the
> > tablet, and table levels, and expose them in the API.
> >
> > These counters could store information about visibility frequencies,
> number
> > of delete entries, etc.
> >
> > The interface might just be a Function<Entry<Key,Value>,Map<String,
> Long>>.
> >
> > In the discussion, there were lots of variations on the theme, though.
> So,
> > the actual implementation could vary. But, having something like this
> could
> > support a large number of use cases beyond just the histogram case.
> >
> > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<[email protected]>
> wrote:
> >
> >> Trivially. We could do something more intelligent like also cache it in
> >> metadata (updating with compactions). Don't read too much into the
> >> implementation at this point; it was just the first idea I had about
> how we
> >> could do it :). I'm more concerned with the idea and its security
> >> implications right now.
> >>
> >> In general, it seems like people are ok with it protected by a new
> >> permission role. Do you have more to add, Mike? Was your comment based
> on
> >> your interpretation of how Accumulo works or more a concern about
> >> implementing such a feature?
> >>
> >> On Oct 11, 2016 21:29,<[email protected]>  wrote:
> >>
> >>> So, to get the set of visibilities used in a table, we would have to
> open
> >>> all of the rfiles?
> >>>
> >>>> -----Original Message-----
> >>>> From: Dylan Hutchison [mailto:[email protected]]
> >>>> Sent: Tuesday, October 11, 2016 3:43 PM
> >>>> To: Accumulo Dev List
> >>>> Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> >>> harmful?
> >>>> Interesting idea.  It begs the question: should we allow any custom
> >>> index at
> >>>> the RFile level?  If RFile indexes were user-extensible, then a
> >>> visibility index
> >>>> would be something any developer could write.  That said, we can still
> >>>> include such an index as an example, and if we did it could be used by
> >>> the
> >>>> Accumulo monitor.
> >>>>
> >>>> The RFile-level sampling followed this path.  I would support further
> >>> work
> >>>> similar to it, though I admit I don't know how difficult a job it
> >>> entails.
> >>>> Bonus points if the index information could be accessed from iterators
> >>> the
> >>>> same way that sampled data can.
> >>>>
> >>>> I can't speak to the appropriateness of visibility histograms on the
> >>> monitor
> >>>> *by default*, but it would be a strictly useful feature if it could be
> >>> enabled via
> >>>> a conf option.
> >>>>
> >>>>
> >>>> On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<[email protected]>
> >>> wrote:
> >>>>> Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
> >> he
> >>>>> mentioned was the lack of insight into the distribution of data
> >> marked
> >>>>> with certain visibilities in a table. He presented an example similar
> >>> to this:
> >>>>> Image a hypothetical system backed by Accumulo which stores medical
> >>>>> information. There are three labels in the system: PRIVATE,
> >>>>> ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
> >> be
> >>>>> considered to identify the individual. ANONYMIZED data is some
> >> altered
> >>>>> version of the attribute that retains some portion of the original
> >>>>> value, but is missing enough context to not identify the individual
> >>>>> (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
> >>>>> attributes which are cannot identify the individual.
> >>>>>
> >>>>> Doctors would be able to read the PRIVATE data, while researchers
> >>>>> could only read the ANONYMIZED and PUBLIC data. This leads to a
> >>>>> question: how much of each kind of data is in the system? Without
> >>>>> knowing how much data is in the system, how can some application
> >>>>> developer (who does not have the ability to read all of the PRIVATE
> >>>>> data) know that their application is returning an reasonably correct
> >>>>> amount of data? (there are many examples of questions which could be
> >>>>> answer on this data alone)
> >>>>>
> >>>>> Concretely, this histogram would look like (50 records with PRIVATE,
> >>>>> 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> >>>>>
> >>>>> ```
> >>>>> PRIVATE: 50
> >>>>> ANONYMIZED: 50
> >>>>> PUBLIC: 20
> >>>>> ```
> >>>>>
> >>>>> Technically, I think this would actually be relatively simple to
> >>>>> implement. Inside of each RFile, we could maintain some histogram of
> >>>>> the visibilities observed in that file. This would allow us to very
> >>>>> easily report how much data in each table has each visibility label.
> >>>>>
> >>>>> However, would this feature be harmful to one of the core tenants of
> >>>>> Accumulo? Or, is acknowledging the existence of data in Accumulo with
> >>>>> a certain visibility acceptable? Would a new permission to use such
> >> an
> >>>>> API to access this information be sufficient to protect the data?
> >>>>>
> >>>>> - Josh
> >>>>>
> >>>
> >
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to