Hah, funny you mention custom RFile index. I think Adam Fuchs had proposed an idea before similar (probably years ago now) :)

re: the monitor, I was more thinking that it would just be an API call to access it. I had not thought about automatically displaying it on the monitor (but it is an interesting idea...)

I remember making a ticket a while back to move the RFile header from a custom serialized object to a Thrift or Protobuf object which would make handling such a drift in "schema" dirt-simple to handle. Eventually there's a concern about putting too much data in there (probably reachable with a large number of visibilities -- implementation detail), but that's a related thought :)

Dylan Hutchison wrote:
Interesting idea.  It begs the question: should we allow any custom index
at the RFile level?  If RFile indexes were user-extensible, then a
visibility index would be something any developer could write.  That said,
we can still include such an index as an example, and if we did it could be
used by the Accumulo monitor.

The RFile-level sampling followed this path.  I would support further work
similar to it, though I admit I don't know how difficult a job it entails.
Bonus points if the index information could be accessed from iterators the
same way that sampled data can.

I can't speak to the appropriateness of visibility histograms on the
monitor *by default*, but it would be a strictly useful feature if it could
be enabled via a conf option.


On Tue, Oct 11, 2016 at 12:20 PM, Josh Elser<[email protected]>  wrote:

Today at Accumulo Summit, our own Russ Weeks gave a talk. One topic he
mentioned was the lack of insight into the distribution of data marked with
certain visibilities in a table. He presented an example similar to this:

Image a hypothetical system backed by Accumulo which stores medical
information. There are three labels in the system: PRIVATE, ANONYMIZED, and
PUBLIC. PRIVATE data is that which could reasonably be considered to
identify the individual. ANONYMIZED data is some altered version of the
attribute that retains some portion of the original value, but is missing
enough context to not identify the individual (e.g. converting the name
"Josh Elser" to "J E"). PUBLIC data is for attributes which are cannot
identify the individual.

Doctors would be able to read the PRIVATE data, while researchers could
only read the ANONYMIZED and PUBLIC data. This leads to a question: how
much of each kind of data is in the system? Without knowing how much data
is in the system, how can some application developer (who does not have the
ability to read all of the PRIVATE data) know that their application is
returning an reasonably correct amount of data? (there are many examples of
questions which could be answer on this data alone)

Concretely, this histogram would look like (50 records with PRIVATE, 50
with ANONYMIZED, and 20 with PUBLIC; 120 records total):

```
PRIVATE: 50
ANONYMIZED: 50
PUBLIC: 20
```

Technically, I think this would actually be relatively simple to
implement. Inside of each RFile, we could maintain some histogram of the
visibilities observed in that file. This would allow us to very easily
report how much data in each table has each visibility label.

However, would this feature be harmful to one of the core tenants of
Accumulo? Or, is acknowledging the existence of data in Accumulo with a
certain visibility acceptable? Would a new permission to use such an API to
access this information be sufficient to protect the data?

- Josh


Reply via email to