Beyond adding a tool on the side. It doesn't fit in metadata as that requires aggregated reads vs table aggregates data.
On Wed, Oct 12, 2016, 11:02 AM Marc P. <marc.par...@gmail.com> wrote: > How does it increase ease of use? > > On Wed, Oct 12, 2016, 10:34 AM ivan bella <i...@ivan.bella.name> wrote: > > Yes the "owners" could create a visibility counting mechanism separately, > however if we make this RFile metadata a part of the system then we > increase the "ease of use". Unfortunately, system designers rarely think > about the metadata they need from their system up front. That being said, > if the performance impact of this is significant then it needs to be made > optional or we leave it as is. > > > > On October 12, 2016 at 7:12 AM "Marc P." <marc.par...@gmail.com> wrote: > > > > > > What prevents the owners of the system from doing this in their own > table? > > Keeping track of that information is a use case of Accumulo. I think this > > may be an example of external code that the user must install. Placing > the > > onus on the consumer mitigates concern that Mike "Mike" Drob and others > may > > have . > > > > A new role wouldn't be needed if permissions were placed on the > > user/table/namespace that stored this information, correct? > > > > On Wed, Oct 12, 2016 at 12:56 AM, Christopher <ctubb...@apache.org> > wrote: > > > > > Keith, Russ, myself (and possible others) were discussing this at the > > > hackathon after the Accumulo Summit, and I think our consensus were > > > basically this: > > > > > > We need a generic pluggable mechanism for injecting arbitrary user > counters > > > into the RFiles. We can then use these counters in custom compaction > > > strategies, or other analysis. We can aggregate these counters at the > > > tablet, and table levels, and expose them in the API. > > > > > > These counters could store information about visibility frequencies, > number > > > of delete entries, etc. > > > > > > The interface might just be a Function<Entry<Key,Value>,Map<String, > > > Long>>. > > > > > > In the discussion, there were lots of variations on the theme, though. > So, > > > the actual implementation could vary. But, having something like this > could > > > support a large number of use cases beyond just the histogram case. > > > > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser <josh.el...@gmail.com> > wrote: > > > > > > > Trivially. We could do something more intelligent like also cache it > in > > > > metadata (updating with compactions). Don't read too much into the > > > > implementation at this point; it was just the first idea I had about > how > > > we > > > > could do it :). I'm more concerned with the idea and its security > > > > implications right now. > > > > > > > > In general, it seems like people are ok with it protected by a new > > > > permission role. Do you have more to add, Mike? Was your comment > based on > > > > your interpretation of how Accumulo works or more a concern about > > > > implementing such a feature? > > > > > > > > On Oct 11, 2016 21:29, <dlmar...@comcast.net> wrote: > > > > > > > > > So, to get the set of visibilities used in a table, we would have > to > > > open > > > > > all of the rfiles? > > > > > > > > > > > -----Original Message----- > > > > > > From: Dylan Hutchison [mailto:dhutc...@cs.washington.edu] > > > > > > Sent: Tuesday, October 11, 2016 3:43 PM > > > > > > To: Accumulo Dev List > > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be > > > > > harmful? > > > > > > > > > > > > Interesting idea. It begs the question: should we allow any > custom > > > > > index at > > > > > > the RFile level? If RFile indexes were user-extensible, then a > > > > > visibility index > > > > > > would be something any developer could write. That said, we can > > > still > > > > > > include such an index as an example, and if we did it could be > used > > > by > > > > > the > > > > > > Accumulo monitor. > > > > > > > > > > > > The RFile-level sampling followed this path. I would support > further > > > > > work > > > > > > similar to it, though I admit I don't know how difficult a job it > > > > > entails. > > > > > > Bonus points if the index information could be accessed from > > > iterators > > > > > the > > > > > > same way that sampled data can. > > > > > > > > > > > > I can't speak to the appropriateness of visibility histograms on > the > > > > > monitor > > > > > > *by default*, but it would be a strictly useful feature if it > could > > > be > > > > > enabled via > > > > > > a conf option. > > > > > > > > > > > > > > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser < > josh.el...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One > topic > > > > he > > > > > > > mentioned was the lack of insight into the distribution of data > > > > marked > > > > > > > with certain visibilities in a table. He presented an example > > > similar > > > > > to this: > > > > > > > > > > > > > > Image a hypothetical system backed by Accumulo which stores > medical > > > > > > > information. There are three labels in the system: PRIVATE, > > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could > reasonably > > > > be > > > > > > > considered to identify the individual. ANONYMIZED data is some > > > > altered > > > > > > > version of the attribute that retains some portion of the > original > > > > > > > value, but is missing enough context to not identify the > individual > > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data > is > > > for > > > > > > > attributes which are cannot identify the individual. > > > > > > > > > > > > > > Doctors would be able to read the PRIVATE data, while > researchers > > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a > > > > > > > question: how much of each kind of data is in the system? > Without > > > > > > > knowing how much data is in the system, how can some > application > > > > > > > developer (who does not have the ability to read all of the > PRIVATE > > > > > > > data) know that their application is returning an reasonably > > > correct > > > > > > > amount of data? (there are many examples of questions which > could > > > be > > > > > > > answer on this data alone) > > > > > > > > > > > > > > Concretely, this histogram would look like (50 records with > > > PRIVATE, > > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total): > > > > > > > > > > > > > > ``` > > > > > > > PRIVATE: 50 > > > > > > > ANONYMIZED: 50 > > > > > > > PUBLIC: 20 > > > > > > > ``` > > > > > > > > > > > > > > Technically, I think this would actually be relatively simple > to > > > > > > > implement. Inside of each RFile, we could maintain some > histogram > > > of > > > > > > > the visibilities observed in that file. This would allow us to > very > > > > > > > easily report how much data in each table has each visibility > > > label. > > > > > > > > > > > > > > However, would this feature be harmful to one of the core > tenants > > > of > > > > > > > Accumulo? Or, is acknowledging the existence of data in > Accumulo > > > with > > > > > > > a certain visibility acceptable? Would a new permission to use > such > > > > an > > > > > > > API to access this information be sufficient to protect the > data? > > > > > > > > > > > > > > - Josh > > > > > > > > > > > > > > > > > > > > > > > > > >