Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Keith Turner Wed, 12 Oct 2016 10:12:03 -0700

On Wed, Oct 12, 2016 at 12:56 AM, Christopher <[email protected]> wrote:
> Keith, Russ, myself (and possible others) were discussing this at the
> hackathon after the Accumulo Summit, and I think our consensus were
> basically this:
>
> We need a generic pluggable mechanism for injecting arbitrary user counters
> into the RFiles. We can then use these counters in custom compaction
> strategies, or other analysis. We can aggregate these counters at the
> tablet, and table levels, and expose them in the API.
>
> These counters could store information about visibility frequencies, number
> of delete entries, etc.
>
> The interface might just be a Function<Entry<Key,Value>,Map<String, Long>>.


One thing I discussed with Russ was following Map Reduce's design for
counters inorder to avoid object allocation.  Something like the
following would avoid allocating a map to return.

interface Counters {
  void increment(ByteSequence counter, long amount);
}

interface Summarizer {
  void summarize(Key k, Value v, Counters counters)
}


>
> In the discussion, there were lots of variations on the theme, though. So,
> the actual implementation could vary. But, having something like this could
> support a large number of use cases beyond just the histogram case.
>
> On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <[email protected]> wrote:
>
>> Trivially. We could do something more intelligent like also cache it in
>> metadata (updating with compactions). Don't read too much into the
>> implementation at this point; it was just the first idea I had about how we
>> could do it :). I'm more concerned with the idea and its security
>> implications right now.
>>
>> In general, it seems like people are ok with it protected by a new
>> permission role. Do you have more to add, Mike? Was your comment based on
>> your interpretation of how Accumulo works or more a concern about
>> implementing such a feature?
>>
>> On Oct 11, 2016 21:29, <[email protected]> wrote:
>>
>> > So, to get the set of visibilities used in a table, we would have to open
>> > all of the rfiles?
>> >
>> > > -----Original Message-----
>> > > From: Dylan Hutchison [mailto:[email protected]]
>> > > Sent: Tuesday, October 11, 2016 3:43 PM
>> > > To: Accumulo Dev List
>> > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
>> > harmful?
>> > >
>> > > Interesting idea.  It begs the question: should we allow any custom
>> > index at
>> > > the RFile level?  If RFile indexes were user-extensible, then a
>> > visibility index
>> > > would be something any developer could write.  That said, we can still
>> > > include such an index as an example, and if we did it could be used by
>> > the
>> > > Accumulo monitor.
>> > >
>> > > The RFile-level sampling followed this path.  I would support further
>> > work
>> > > similar to it, though I admit I don't know how difficult a job it
>> > entails.
>> > > Bonus points if the index information could be accessed from iterators
>> > the
>> > > same way that sampled data can.
>> > >
>> > > I can't speak to the appropriateness of visibility histograms on the
>> > monitor
>> > > *by default*, but it would be a strictly useful feature if it could be
>> > enabled via
>> > > a conf option.
>> > >
>> > >
>> > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <[email protected]>
>> > wrote:
>> > >
>> > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic
>> he
>> > > > mentioned was the lack of insight into the distribution of data
>> marked
>> > > > with certain visibilities in a table. He presented an example similar
>> > to this:
>> > > >
>> > > > Image a hypothetical system backed by Accumulo which stores medical
>> > > > information. There are three labels in the system: PRIVATE,
>> > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could reasonably
>> be
>> > > > considered to identify the individual. ANONYMIZED data is some
>> altered
>> > > > version of the attribute that retains some portion of the original
>> > > > value, but is missing enough context to not identify the individual
>> > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data is for
>> > > > attributes which are cannot identify the individual.
>> > > >
>> > > > Doctors would be able to read the PRIVATE data, while researchers
>> > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
>> > > > question: how much of each kind of data is in the system? Without
>> > > > knowing how much data is in the system, how can some application
>> > > > developer (who does not have the ability to read all of the PRIVATE
>> > > > data) know that their application is returning an reasonably correct
>> > > > amount of data? (there are many examples of questions which could be
>> > > > answer on this data alone)
>> > > >
>> > > > Concretely, this histogram would look like (50 records with PRIVATE,
>> > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
>> > > >
>> > > > ```
>> > > > PRIVATE: 50
>> > > > ANONYMIZED: 50
>> > > > PUBLIC: 20
>> > > > ```
>> > > >
>> > > > Technically, I think this would actually be relatively simple to
>> > > > implement. Inside of each RFile, we could maintain some histogram of
>> > > > the visibilities observed in that file. This would allow us to very
>> > > > easily report how much data in each table has each visibility label.
>> > > >
>> > > > However, would this feature be harmful to one of the core tenants of
>> > > > Accumulo? Or, is acknowledging the existence of data in Accumulo with
>> > > > a certain visibility acceptable? Would a new permission to use such
>> an
>> > > > API to access this information be sufficient to protect the data?
>> > > >
>> > > > - Josh
>> > > >
>> >
>> >
>>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to