We did discuss making this info available through the public API (and adding thrift calls to gather it). We discussed the possibility of adding a new permission.
On Wed, Oct 12, 2016 at 2:35 PM, ivan bella <[email protected]> wrote: > I do not see how this invalidates any security of the system unless you are > summarizing these counters and making them available through a thrift or > other call; don't do that unless other security is put in place. To get a > summary I would think you would have to use a separate utility to scrape the > rfiles. This metadata should only be accessible to a system administrator. > The BIG presumption here is that is is significantly faster to grab this > metadata data out than it is to scan all of the keys in the rfile. > > >> On October 12, 2016 at 1:41 PM Josh Elser <[email protected]> wrote: >> >> Thanks, Marc. Follow-on question(s) for you: >> >> Do you think _any_ such approach should never be pursued by Accumulo >> (reading into your other replies about doing it outside of Accumulo)? >> Are the permissions that we have in place not sufficient to protect such >> "metadata"? >> >> Or, would such a feature be "OK" to you if it required some degree of >> additional manual steps by the administrator? (if so, what steps do you >> think make this acceptable) >> >> In a similar vein, how do you see this broadening the scope of the >> Accumulo security model in an invalid manner? e.g. Administrators should >> never be able to see such information. Someone with sufficient access to >> a system would already be able to bypass Accumulo's security mechanisms. >> There are a number of vectors already were a sufficiently-credentialed >> individual could figure out this information (and more). >> >> Ultimately, I see Accumulo's main security tenet as "users should never >> be allowed to see more data than they are authorized to see". Maybe it's >> my interpretation of that or the scope of how your think the proposed >> feature would function, but I'd be very interested in hearing more about >> what you think. >> >> Marc P. wrote: >> >> > My point for discussing implementation outside of accumulo is because I >> > think it does invalidate a core tenant >> > >> > On Wed, Oct 12, 2016, 12:26 PM Josh Elser<[email protected]> wrote: >> > >> > > Again, can we please bring this discussion back from discussions of >> > > implementations to security? >> > > >> > > Does the fact that you three were discussing implementations imply that >> > > you do not think this invalidates one of the core tenets (security >> > > first) of Accumulo? >> > > >> > > Christopher wrote: >> > > >> > > > Keith, Russ, myself (and possible others) were discussing this at the >> > > > hackathon after the Accumulo Summit, and I think our consensus were >> > > > basically this: >> > > > >> > > > We need a generic pluggable mechanism for injecting arbitrary user >> > > > counters >> > > > into the RFiles. We can then use these counters in custom compaction >> > > > strategies, or other analysis. We can aggregate these counters at the >> > > > tablet, and table levels, and expose them in the API. >> > > > >> > > > These counters could store information about visibility frequencies, >> > > > number >> > > > of delete entries, etc. >> > > > >> > > > The interface might just be a Function<Entry<Key,Value>,Map<String, >> > > > Long>>. >> > > > In the discussion, there were lots of variations on the theme, though. >> > > > So, >> > > > the actual implementation could vary. But, having something like this >> > > > could >> > > > support a large number of use cases beyond just the histogram case. >> > > > >> > > > On Tue, Oct 11, 2016 at 10:06 PM Josh Elser<[email protected]> >> > > > wrote: >> > > > >> > > > > Trivially. We could do something more intelligent like also cache it >> > > > > in >> > > > > metadata (updating with compactions). Don't read too much into the >> > > > > implementation at this point; it was just the first idea I had about >> > > > > how we >> > > > > could do it :). I'm more concerned with the idea and its security >> > > > > implications right now. >> > > > > >> > > > > In general, it seems like people are ok with it protected by a new >> > > > > permission role. Do you have more to add, Mike? Was your comment >> > > > > based >> > > > > on >> > > > > your interpretation of how Accumulo works or more a concern about >> > > > > implementing such a feature? >> > > > > >> > > > > On Oct 11, 2016 21:29,<[email protected]> wrote: >> > > > > >> > > > > > So, to get the set of visibilities used in a table, we would have >> > > > > > to >> > > > > > open >> > > > > > all of the rfiles? >> > > > > > >> > > > > > > -----Original Message----- >> > > > > > > From: Dylan Hutchison [mailto:[email protected]] >> > > > > > > Sent: Tuesday, October 11, 2016 3:43 PM >> > > > > > > To: Accumulo Dev List >> > > > > > > Subject: Re: [DISCUSS] Would a visibility histogram on a table be >> > > > > > > harmful? >> > > > > > > Interesting idea. It begs the question: should we allow any >> > > > > > > custom >> > > > > > > index at >> > > > > > > the RFile level? If RFile indexes were user-extensible, then a >> > > > > > > visibility index >> > > > > > > would be something any developer could write. That said, we can >> > > > > > > still >> > > > > > > include such an index as an example, and if we did it could be >> > > > > > > used by >> > > > > > > the >> > > > > > > Accumulo monitor. >> > > > > > > >> > > > > > > The RFile-level sampling followed this path. I would support >> > > > > > > further >> > > > > > > work >> > > > > > > similar to it, though I admit I don't know how difficult a job it >> > > > > > > entails. >> > > > > > > Bonus points if the index information could be accessed from >> > > > > > > iterators >> > > > > > > the >> > > > > > > same way that sampled data can. >> > > > > > > >> > > > > > > I can't speak to the appropriateness of visibility histograms on >> > > > > > > the >> > > > > > > monitor >> > > > > > > *by default*, but it would be a strictly useful feature if it >> > > > > > > could be >> > > > > > > enabled via >> > > > > > > a conf option. >> > > > > > > >> > > > > > > On Tue, Oct 11, 2016 at 12:20 PM, Josh >> > > > > > > Elser<[email protected]> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Today at Accumulo Summit, our own Russ Weeks gave a talk. One >> > > > > > > > topic >> > > > > > > > he >> > > > > > > > mentioned was the lack of insight into the distribution of data >> > > > > > > > marked >> > > > > > > > with certain visibilities in a table. He presented an example >> > > > > > > > similar >> > > > > > > > to this: >> > > > > > > > Image a hypothetical system backed by Accumulo which stores >> > > > > > > > medical >> > > > > > > > information. There are three labels in the system: PRIVATE, >> > > > > > > > ANONYMIZED, and PUBLIC. PRIVATE data is that which could >> > > > > > > > reasonably >> > > > > > > > be >> > > > > > > > considered to identify the individual. ANONYMIZED data is some >> > > > > > > > altered >> > > > > > > > version of the attribute that retains some portion of the >> > > > > > > > original >> > > > > > > > value, but is missing enough context to not identify the >> > > > > > > > individual >> > > > > > > > (e.g. converting the name "Josh Elser" to "J E"). PUBLIC data >> > > > > > > > is for >> > > > > > > > attributes which are cannot identify the individual. >> > > > > > > > >> > > > > > > > Doctors would be able to read the PRIVATE data, while >> > > > > > > > researchers >> > > > > > > > could only read the ANONYMIZED and PUBLIC data. This leads to a >> > > > > > > > question: how much of each kind of data is in the system? >> > > > > > > > Without >> > > > > > > > knowing how much data is in the system, how can some >> > > > > > > > application >> > > > > > > > developer (who does not have the ability to read all of the >> > > > > > > > PRIVATE >> > > > > > > > data) know that their application is returning an reasonably >> > > > > > > > correct >> > > > > > > > amount of data? (there are many examples of questions which >> > > > > > > > could be >> > > > > > > > answer on this data alone) >> > > > > > > > >> > > > > > > > Concretely, this histogram would look like (50 records with >> > > > > > > > PRIVATE, >> > > > > > > > 50 with ANONYMIZED, and 20 with PUBLIC; 120 records total): >> > > > > > > > >> > > > > > > > PRIVATE: 50 >> > > > > > > > ANONYMIZED: 50 >> > > > > > > > PUBLIC: 20 >> > > > > > > > >> > > > > > > > Technically, I think this would actually be relatively simple >> > > > > > > > to >> > > > > > > > implement. Inside of each RFile, we could maintain some >> > > > > > > > histogram of >> > > > > > > > the visibilities observed in that file. This would allow us to >> > > > > > > > very >> > > > > > > > easily report how much data in each table has each visibility >> > > > > > > > label. >> > > > > > > > >> > > > > > > > However, would this feature be harmful to one of the core >> > > > > > > > tenants of >> > > > > > > > Accumulo? Or, is acknowledging the existence of data in >> > > > > > > > Accumulo with >> > > > > > > > a certain visibility acceptable? Would a new permission to use >> > > > > > > > such >> > > > > > > > an >> > > > > > > > API to access this information be sufficient to protect the >> > > > > > > > data? >> > > > > > > > >> > > > > > > > * Josh
