Hello. Thanks for the suggestions! So I noted the last time that I touched the Prov Repo that we definitely need to provide some docs about the overall design of repo, and why we designed it the way that we did. There are definitely some things that are not obvious about the design.
Let me use this e-mail as a starting point for addressing the why's of the design, based on the feedback provided. Comments are in-line, below. If, after you read the explanation for why we chose the current implementation/design, you feel there are still some things that should be changed, I (and I'm sure several others) would be happy to discuss those in more detail. Thanks! -Mark ---------------------------------------- > Date: Thu, 4 Jun 2015 20:25:21 -0400 > Subject: Provenance Repository Interface > From: [email protected] > To: [email protected] > > Hello! > > Thanks for all the support and direction thus far in my efforts to > improve the system provenance record keeping. > > I'd like to propose a few changes to the ProvenanceEventRepository > API. There are some inconsistent and error-prone method calls > available that I think can be enhanced. > > I've based some of my ideas off of the Hibernate Session object that > allows for this kind of plug-able behavior of persisting and searching > objects. In general, I believe there should be a bigger overhaul down > the road, but some items to consider for the shorter term: > > 1. Do not assume ProvenanceEventRecord objects have an ID of type > long. Instead, I'd like to see the ID be "Serializable" so one can > use numerical values, byte arrays (i.e., hashes), Strings (i.e., > UUID). There are a couple of reasons that we use Longs as identifiers, but the main reason is that it provides a guaranteed ordering of the events. This can be very important because NiFi can sometimes perform several Provenance Events for the same FlowFile within the same millisecond. As a result, the timestamp cannot guarantee ordering. Additionally, the Provenance Repo provides two very important mechanisms for retrieving the data: querying the data, as you would see in the UI; but also iterating over the events in the repository. The latter case means that we need a way to paginate, essentially, over the entire repository. We do this by using the getEvents method, indicating the first Event ID that we are interested in, as well as the maximum number of events. > > 2. Remove the method "getEvents". From what I can tell, the only > place it is used is to generate the "Oldest event available" display > on the Provenance UI. The way in which it is used is a bit wonky and > not actually correct. The ControllerFacade assumes that the first > item submitted to the repository is going to be the oldest. This > isn't true. Since the client is responsible for setting the event > date-time, the first item could have any arbitrary value and not > actually the "oldest" event in a way that most users would expect. I > would recommend removing this label from the UI altogether. It's > value seems limited and is generated in this incorrect manner. The getEvents method is used to iterate over all events (or some subset of Events) in the repository. This is typically done in a Reporting Task such that we can process all events in the repository (I spoke about this a bit in the first point above). With respect to obtaining the oldest event: you are correct in that it does not absolutely guarantee the first event based on timestamp. It provides the first event based on Event ID. If the repository is updated simultaneously with many events, it's possible that those two could be different from one another. However, in practice, since the only way for these two to diverge is if multiple threads are updating repository simultaneously, then if Event A has ID 1 and Event B has ID 2, we know that the timestamps of Event A and Event B are going to be EXTREMELY close - i.e., it's very unlikely that they will vary by more than a millisecond or two, though they could due to garbage collection, etc. Also note, that you are absolutely correct in that the "client is responsible for setting the event date-time" but remember that the client in this sense is actually the NiFi Framework. The repo is not expected to be used by arbitrary clients. The way this is used, though, is simply to indicate to the user the timestamp of the earliest event so that they have an understanding of how far back the repository goes. So, for instance, if they are interested in searching the repo for events that happened to some piece of data that they think was received 2 weeks ago, they should know immediately if the events will be there or not. > > 3. Add a "clear" method to allow the users to manually empty the repository. > We could certainly do this, though I don't honestly understand the use case. Can you explain why it would be beneficial to allow someone to clear the entire repository? > 4. Remove method "getMaxEventId". In an implementation that uses > UUID/Hash ids, there is no such thing as a "max" ID. > See explanation above for #1 as to why I think we need to keep Event ID's as Longs. > 5. Remove methods "registerEvent"/"registerEvents" in favor of a > single "addEvent" that accepts a single ProvenanceEventRecord and > returns its Serializable ID. > We use registerEvents so that if we have multiple events we can add them to the repository in a single "transaction" and because it is far more efficient to store a batch of events than to store a single event. The registerEvent method really is just a convenience method so that the code updating the repo doesn't always have to wrap a single event in a Collection themselves if they want to update the repo with just one event. > 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or > something like that) instead of using explicit IO exceptions. As is, > the IOException seems inconsistent across methods. > I definitely agree that it is inconsistent across methods. This needs to be addressed. I have created a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660. I do, however, content that IOException is appropriate here. If the repo fails to update due to a problem writing to disk or to a network or whatever, it should absolutely throw an IOException. Any time that we are interacting with some sort of external device (be it network, disk, or whatever), there is a very real possibility that there will be environmental problems (such as disk full) that will cause us to fail, and any code that calls into such a method needs to be responsible for handling those conditions properly, so I feel a checked exception is appropriate. > I have attached a sample of these changes though only at the > Interface, not all changes required. Changing to a Serialize object > would affect several classes. > > Thanks.
