I totally understand where you are coming from. And I very much appreciate your consideration of how to improve such an important part of the system. I am sure that the Provenance Repo is far from perfect and will continue to evolve over time. And creating a flexible, pluggable framework is indeed a good thing.
It often does, however, come at a cost. In this case, an important potential cost would be performance. Consider this use case (which is a very real use case that drove much of the implementation and design). I have a ReportingTask that needs to process every Provenance Event that occurs. I need to process the events with little overhead, and each node is expected to keep several billion events. In this case, I will ask for 1000 events at a time starting at event id 1 (I do this using the getEvents method. This is not used by the framework except in the case that you noticed, but is intended for ReportingTasks mostly). Once I've processed them, I will persist the fact that I finished record 1000 and then ask for 1000 events starting at event id 1001 and repeat. If there are no more events, I wait 1 second and then continue. Requesting events starting at a particular event id is quite inexpensive due to the specific implementation. If we were required to instead say "start at page X using this sort order", each request for 1000 events will have to do quite a bit of work to sort those events and can be pretty expensive. We request only 1000 events at a time because it allows us to be conservative with heap utilization. However, we do want to pull back several events at a time for efficiency reasons (disk IO, etc). It's also important to note that the existing implementation has been used in massive clustered, distributed environments for years with great success. Nodes do not send events to the NCM in an async manner, as I believe you are describing below. When a user issues a query, the NCM replicates the request to each node and then merges their responses. I don't by any means want to imply that I think using a Serializable as the key is not a reasonable (or perhaps an even better) solution for some use cases. However, it would come at some costs and I think that we would need some very convincing, concrete use cases in order to make such a significant change beneficial. Thanks -Mark ---------------------------------------- > Date: Sat, 6 Jun 2015 00:08:57 -0400 > Subject: Re: Provenance Repository Interface > From: [email protected] > To: [email protected] > > Hello, > > Thanks again for your time and sincere considerations. > > Here is the sample I was trying to email. Jira was giving me fits, so > I tried to email it, but since it's just a discussion at this point, > and not really an action item, I've dropped it on PasteBin. > > http://pastebin.com/ani98rS5 > > In general, I'm trying to think of ways to allow for a more flexible, > pluggable, framework and not being tied down to how things exist at > the moment. > > 1) Longs v.s. UUID. I think you'll find that when you move to a > clustered, distributed environment, UUIDs will be more useful. So at > this point, I'm not asking to move the current code base away from > numerical IDs. I'm simply asking that the interface be changed to > allow for it. Defining Serializable IDs allows for the current > numerical implementation and also for someone else to drop in UUIDs > later (or whatever other IDs they can think of). > > Pagination is traditionally performed by sorting on a particular > domain field (date most often) and then taking a subset of the sorted > set. I've already described to you a way in which event dates can be > completely arbitrary in terms of insertion in a multi-threaded > application. Another practical situation would be in the case of a > cluster. A node may queue events during a network outage and later > submit events to the cluster manager where the IDs no longer give any > kind of indication of the event time. With such an outage, it's not > reasonable to assume that event IDs are inserted in the cluster > manager in any kind of ordering. Perhaps the answer is to sort on > event date, and have each processor stamp some sort of one-up marker > on each event so that even if it happens within the same millisecond, > order can be preserved. I'd have to think about that more, but the > answer is not to rely on the IDs having any kind of implicit ordering. > > 2) getEvents - I must apologize. Perhaps I am missing something, but > the only place I can locate its use is in ControllerFacade and it's to > generate the one value that displays the the "oldest" event. What is > the use case for viewing all the events in the repository? If it's > important to iterate over all the events in the repository, allow the > implementation to decide how pagination works. A simple method like: > public Collection<ProvenanceEventRecord> > getProvenanceEventRecords(final int pageNo, final int pageSize) would > suffice. > > 3) If you feel that the "oldest" event is a required feature, then > allow the repository to implement the solution. Include a > "getOldestEvent" method or update the query mechanism to allow for > sorted and "top n" type queries. > > 4) The "clear" method would be nice to plan for now, but tying it > into the UI could come at some later point. The feature would be > mostly useful for testing purposes. One may want to build a flow, run > some data through it, reset everything, make some tweaks, and try it > again. It's helpful to clear out the data between test runs so that > there is no confusion about what events were generated on THIS test > run an which were generated during the LAST test run. Secondly, it's > a quick way to clear out the events in the case where resources have > become lacking for some reason and an operator needs a quick fix. > Dumping months of events may be a quick option. > > 5) Pedantic, but I was suggesting addEvent, getEvent, to map more > closely to what a Java developer would expect to see when working with > some sort of data "collection" or "repository." This would be in > place of the method "registerEvent." Second, when I was writing the > unit tests, it felt clunky because I had no easy way to get the ID > that the repository generated for the even after I submitted it. I > wanted to write a simple: > > long id = addEvent(actualEvent); > Event resultEvent = getEvent(id); > assertEquals(actualEvent,resultEvent); > > unit test, but it's not currently possible. The addEvent method > should return the created ID. In such a scenario, to be consistent, a > given addEvents method should return a list of IDs that were created, > one for each event. That's one way to go about it, but it's not > really clean. Depending on the storage mechanism, some of the events > could be persisted, but if an error occurred part way through the > list, and there was no rollback plan, then there's no way for the > repository to communicate the exception and the list of IDs that did > succeed. I think that should be handled at a higher level. > > 6) Well, there's always the debate these days about checked v.s. > unchecked exceptions. I would just say that if the repository is > suppose to do any validation of the events, then it could be rolled > up, with IOException into one custom event that all methods throw. In > the current implementation of the Standard ProvenanceEventBuilder, I > see that it does validation in the build method, though that is not > documented as a requirement of a ProvenanceEventBuilder. So, I guess > if that is documented as part of the function of the build method, and > not on the repository, then the repository might reasonably be believe > to only choke on an IOException. > > Thanks! > > > On Fri, Jun 5, 2015 at 8:16 AM, Mark Payne <[email protected]> wrote: >> Hello. >> >> Thanks for the suggestions! So I noted the last time that I touched the Prov >> Repo that we definitely need to >> provide some docs about the overall design of repo, and why we designed it >> the way that we did. There are >> definitely some things that are not obvious about the design. >> >> Let me use this e-mail as a starting point for addressing the why's of the >> design, based on the feedback provided. >> Comments are in-line, below. >> >> If, after you read the explanation for why we chose the current >> implementation/design, you feel there are still some things >> that should be changed, I (and I'm sure several others) would be happy to >> discuss those in more detail. >> >> Thanks! >> -Mark >> >> ---------------------------------------- >>> Date: Thu, 4 Jun 2015 20:25:21 -0400 >>> Subject: Provenance Repository Interface >>> From: [email protected] >>> To: [email protected] >>> >>> Hello! >>> >>> Thanks for all the support and direction thus far in my efforts to >>> improve the system provenance record keeping. >>> >>> I'd like to propose a few changes to the ProvenanceEventRepository >>> API. There are some inconsistent and error-prone method calls >>> available that I think can be enhanced. >>> >>> I've based some of my ideas off of the Hibernate Session object that >>> allows for this kind of plug-able behavior of persisting and searching >>> objects. In general, I believe there should be a bigger overhaul down >>> the road, but some items to consider for the shorter term: >>> >>> 1. Do not assume ProvenanceEventRecord objects have an ID of type >>> long. Instead, I'd like to see the ID be "Serializable" so one can >>> use numerical values, byte arrays (i.e., hashes), Strings (i.e., >>> UUID). >> >> There are a couple of reasons that we use Longs as identifiers, but the main >> reason is that it provides a guaranteed ordering of the events. This can be >> very >> important because NiFi can sometimes perform several Provenance Events for >> the >> same FlowFile within the same millisecond. As a result, the timestamp cannot >> guarantee >> ordering. >> >> Additionally, the Provenance Repo provides two very important mechanisms for >> retrieving >> the data: querying the data, as you would see in the UI; but also iterating >> over the events >> in the repository. The latter case means that we need a way to paginate, >> essentially, over >> the entire repository. We do this by using the getEvents method, indicating >> the first >> Event ID that we are interested in, as well as the maximum number of events. >> >>> >>> 2. Remove the method "getEvents". From what I can tell, the only >>> place it is used is to generate the "Oldest event available" display >>> on the Provenance UI. The way in which it is used is a bit wonky and >>> not actually correct. The ControllerFacade assumes that the first >>> item submitted to the repository is going to be the oldest. This >>> isn't true. Since the client is responsible for setting the event >>> date-time, the first item could have any arbitrary value and not >>> actually the "oldest" event in a way that most users would expect. I >>> would recommend removing this label from the UI altogether. It's >>> value seems limited and is generated in this incorrect manner. >> >> The getEvents method is used to iterate over all events (or some subset of >> Events) in the >> repository. This is typically done in a Reporting Task such that we can >> process all events >> in the repository (I spoke about this a bit in the first point above). >> >> With respect to obtaining the oldest event: you are correct in that it does >> not absolutely >> guarantee the first event based on timestamp. It provides the first event >> based on Event ID. >> If the repository is updated simultaneously with many events, it's possible >> that those two >> could be different from one another. However, in practice, since the only >> way for these >> two to diverge is if multiple threads are updating repository >> simultaneously, then if Event A >> has ID 1 and Event B has ID 2, we know that the timestamps of Event A and >> Event B are going to >> be EXTREMELY close - i.e., it's very unlikely that they will vary by more >> than a millisecond or two, >> though they could due to garbage collection, etc. Also note, that you are >> absolutely correct in >> that the "client is responsible for setting the event date-time" but >> remember that the client >> in this sense is actually the NiFi Framework. The repo is not expected to be >> used by arbitrary >> clients. >> >> The way this is used, though, is simply to indicate to the user the >> timestamp of the earliest event >> so that they have an understanding of how far back the repository goes. So, >> for instance, if they are >> interested in searching the repo for events that happened to some piece of >> data that they think was >> received 2 weeks ago, they should know immediately if the events will be >> there or not. >> >> >>> >>> 3. Add a "clear" method to allow the users to manually empty the repository. >>> >> >> We could certainly do this, though I don't honestly understand the use case. >> Can you explain why >> it would be beneficial to allow someone to clear the entire repository? >> >>> 4. Remove method "getMaxEventId". In an implementation that uses >>> UUID/Hash ids, there is no such thing as a "max" ID. >>> >> >> See explanation above for #1 as to why I think we need to keep Event ID's as >> Longs. >> >> >>> 5. Remove methods "registerEvent"/"registerEvents" in favor of a >>> single "addEvent" that accepts a single ProvenanceEventRecord and >>> returns its Serializable ID. >>> >> >> We use registerEvents so that if we have multiple events we can add them to >> the repository in a single >> "transaction" and because it is far more efficient to store a batch of >> events than to store a single event. >> The registerEvent method really is just a convenience method so that the >> code updating the repo doesn't >> always have to wrap a single event in a Collection themselves if they want >> to update the repo with just >> one event. >> >> >>> 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or >>> something like that) instead of using explicit IO exceptions. As is, >>> the IOException seems inconsistent across methods. >>> >> >> I definitely agree that it is inconsistent across methods. This needs to be >> addressed. I have created >> a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660. >> >> I do, however, content that IOException is appropriate here. If the repo >> fails to update due to a problem writing >> to disk or to a network or whatever, it should absolutely throw an >> IOException. Any time that we are interacting >> with some sort of external device (be it network, disk, or whatever), there >> is a very real possibility that there will >> be environmental problems (such as disk full) that will cause us to fail, >> and any code that calls into such a method >> needs to be responsible for handling those conditions properly, so I feel a >> checked exception is appropriate. >> >>> I have attached a sample of these changes though only at the >>> Interface, not all changes required. Changing to a Serialize object >>> would affect several classes. >>> >>> Thanks. >>
