Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Marc P. Wed, 12 Oct 2016 08:05:09 -0700

Beyond adding a tool on the side. It doesn't fit in metadata as that
requires aggregated reads vs table aggregates data.


On Wed, Oct 12, 2016, 11:02 AM Marc P. <[email protected]> wrote:

> How does it increase ease of use?
>
> On Wed, Oct 12, 2016, 10:34 AM ivan bella <[email protected]> wrote:
>
> Yes the "owners" could create a visibility counting mechanism separately,
> however if we make this RFile metadata a part of the system then we
> increase the "ease of use". Unfortunately, system designers rarely think
> about the metadata they need from their system up front. That being said,
> if the performance impact of this is significant then it needs to be made
> optional or we leave it as is.
>
>
> > On October 12, 2016 at 7:12 AM "Marc P." <[email protected]> wrote:
> >
> >
> > What prevents the owners of the system from doing this in their own
> table?
> > Keeping track of that information is a use case of Accumulo. I think this
> > may be an example of external code that the user must install. Placing
> the
> > onus on the consumer mitigates concern that Mike "Mike" Drob and others
> may
> > have .
> >
> > A new role wouldn't be needed if permissions were placed on the
> > user/table/namespace that stored this information, correct?
> >
> > On Wed, Oct 12, 2016 at 12:56 AM, Christopher <[email protected]>
> wrote:
> >
> > > Keith, Russ, myself (and possible others) were discussing this at the
> > > hackathon after the Accumulo Summit, and I think our consensus were
> > > basically this:
> > >
> > > We need a generic pluggable mechanism for injecting arbitrary user
> counters
> > > into the RFiles. We can then use these counters in custom compaction
> > > strategies, or other analysis. We can aggregate these counters at the
> > > tablet, and table levels, and expose them in the API.
> > >
> > > These counters could store information about visibility frequencies,
> number
> > > of delete entries, etc.
> > >
> > > The interface might just be a Function<Entry<Key,Value>,Map<String,
> > > Long>>.
> > >
> > > In the discussion, there were lots of variations on the theme, though.
> So,
> > > the actual implementation could vary. But, having something like this
> could
> > > support a large number of use cases beyond just the histogram case.
> > >
> > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <[email protected]>
> wrote:
> > >
> > > > Trivially. We could do something more intelligent like also cache it
> in
> > > > metadata (updating with compactions). Don't read too much into the
> > > > implementation at this point; it was just the first idea I had about
> how
> > > we
> > > > could do it :). I'm more concerned with the idea and its security
> > > > implications right now.
> > > >
> > > > In general, it seems like people are ok with it protected by a new
> > > > permission role. Do you have more to add, Mike? Was your comment
> based on
> > > > your interpretation of how Accumulo works or more a concern about
> > > > implementing such a feature?
> > > >
> > > > On Oct 11, 2016 21:29, <[email protected]> wrote:
> > > >
> > > > > So, to get the set of visibilities used in a table, we would have
> to
> > > open
> > > > > all of the rfiles?
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Dylan Hutchison [mailto:[email protected]]
> > > > > > Sent: Tuesday, October 11, 2016 3:43 PM
> > > > > > To: Accumulo Dev List
> > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be
> > > > > harmful?
> > > > > >
> > > > > > Interesting idea. It begs the question: should we allow any
> custom
> > > > > index at
> > > > > > the RFile level? If RFile indexes were user-extensible, then a
> > > > > visibility index
> > > > > > would be something any developer could write. That said, we can
> > > still
> > > > > > include such an index as an example, and if we did it could be
> used
> > > by
> > > > > the
> > > > > > Accumulo monitor.
> > > > > >
> > > > > > The RFile-level sampling followed this path. I would support
> further
> > > > > work
> > > > > > similar to it, though I admit I don't know how difficult a job it
> > > > > entails.
> > > > > > Bonus points if the index information could be accessed from
> > > iterators
> > > > > the
> > > > > > same way that sampled data can.
> > > > > >
> > > > > > I can't speak to the appropriateness of visibility histograms on
> the
> > > > > monitor
> > > > > > *by default*, but it would be a strictly useful feature if it
> could
> > > be
> > > > > enabled via
> > > > > > a conf option.
> > > > > >
> > > > > >
> > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One
> topic
> > > > he
> > > > > > > mentioned was the lack of insight into the distribution of data
> > > > marked
> > > > > > > with certain visibilities in a table. He presented an example
> > > similar
> > > > > to this:
> > > > > > >
> > > > > > > Image a hypothetical system backed by Accumulo which stores
> medical
> > > > > > > information. There are three labels in the system: PRIVATE,
> > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could
> reasonably
> > > > be
> > > > > > > considered to identify the individual. ANONYMIZED data is some
> > > > altered
> > > > > > > version of the attribute that retains some portion of the
> original
> > > > > > > value, but is missing enough context to not identify the
> individual
> > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data
> is
> > > for
> > > > > > > attributes which are cannot identify the individual.
> > > > > > >
> > > > > > > Doctors would be able to read the PRIVATE data, while
> researchers
> > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a
> > > > > > > question: how much of each kind of data is in the system?
> Without
> > > > > > > knowing how much data is in the system, how can some
> application
> > > > > > > developer (who does not have the ability to read all of the
> PRIVATE
> > > > > > > data) know that their application is returning an reasonably
> > > correct
> > > > > > > amount of data? (there are many examples of questions which
> could
> > > be
> > > > > > > answer on this data alone)
> > > > > > >
> > > > > > > Concretely, this histogram would look like (50 records with
> > > PRIVATE,
> > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total):
> > > > > > >
> > > > > > > ```
> > > > > > > PRIVATE: 50
> > > > > > > ANONYMIZED: 50
> > > > > > > PUBLIC: 20
> > > > > > > ```
> > > > > > >
> > > > > > > Technically, I think this would actually be relatively simple
> to
> > > > > > > implement. Inside of each RFile, we could maintain some
> histogram
> > > of
> > > > > > > the visibilities observed in that file. This would allow us to
> very
> > > > > > > easily report how much data in each table has each visibility
> > > label.
> > > > > > >
> > > > > > > However, would this feature be harmful to one of the core
> tenants
> > > of
> > > > > > > Accumulo? Or, is acknowledging the existence of data in
> Accumulo
> > > with
> > > > > > > a certain visibility acceptable? Would a new permission to use
> such
> > > > an
> > > > > > > API to access this information be sufficient to protect the
> data?
> > > > > > >
> > > > > > > - Josh
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
>
>

Re: [DISCUSS] Would a visibility histogram on a table be harmful?

Reply via email to