Hello.

Thanks for the suggestions! So I noted the last time that I touched the Prov 
Repo that we definitely need to
provide some docs about the overall design of repo, and why we designed it the 
way that we did. There are
definitely some things that are not obvious about the design.

Let me use this e-mail as a starting point for addressing the why's of the 
design, based on the feedback provided.
Comments are in-line, below.

If, after you read the explanation for why we chose the current 
implementation/design, you feel there are still some things
that should be changed, I (and I'm sure several others) would be happy to 
discuss those in more detail.

Thanks!
-Mark

----------------------------------------
> Date: Thu, 4 Jun 2015 20:25:21 -0400
> Subject: Provenance Repository Interface
> From: [email protected]
> To: [email protected]
>
> Hello!
>
> Thanks for all the support and direction thus far in my efforts to
> improve the system provenance record keeping.
>
> I'd like to propose a few changes to the ProvenanceEventRepository
> API. There are some inconsistent and error-prone method calls
> available that I think can be enhanced.
>
> I've based some of my ideas off of the Hibernate Session object that
> allows for this kind of plug-able behavior of persisting and searching
> objects. In general, I believe there should be a bigger overhaul down
> the road, but some items to consider for the shorter term:
>
> 1. Do not assume ProvenanceEventRecord objects have an ID of type
> long. Instead, I'd like to see the ID be "Serializable" so one can
> use numerical values, byte arrays (i.e., hashes), Strings (i.e.,
> UUID).

There are a couple of reasons that we use Longs as identifiers, but the main
reason is that it provides a guaranteed ordering of the events. This can be very
important because NiFi can sometimes perform several Provenance Events for the
same FlowFile within the same millisecond. As a result, the timestamp cannot 
guarantee
ordering.

Additionally, the Provenance Repo provides two very important mechanisms for 
retrieving
the data: querying the data, as you would see in the UI; but also iterating 
over the events
in the repository. The latter case means that we need a way to paginate, 
essentially, over
the entire repository. We do this by using the getEvents method, indicating the 
first
Event ID that we are interested in, as well as the maximum number of events.

>
> 2. Remove the method "getEvents". From what I can tell, the only
> place it is used is to generate the "Oldest event available" display
> on the Provenance UI. The way in which it is used is a bit wonky and
> not actually correct. The ControllerFacade assumes that the first
> item submitted to the repository is going to be the oldest. This
> isn't true. Since the client is responsible for setting the event
> date-time, the first item could have any arbitrary value and not
> actually the "oldest" event in a way that most users would expect. I
> would recommend removing this label from the UI altogether. It's
> value seems limited and is generated in this incorrect manner.

The getEvents method is used to iterate over all events (or some subset of 
Events) in the
repository. This is typically done in a Reporting Task such that we can process 
all events
in the repository (I spoke about this a bit in the first point above).

With respect to obtaining the oldest event: you are correct in that it does not 
absolutely
guarantee the first event based on timestamp. It provides the first event based 
on Event ID.
If the repository is updated simultaneously with many events, it's possible 
that those two
could be different from one another. However, in practice, since the only way 
for these
two to diverge is if multiple threads are updating repository simultaneously, 
then if Event A
has ID 1 and Event B has ID 2, we know that the timestamps of Event A and Event 
B are going to
be EXTREMELY close - i.e., it's very unlikely that they will vary by more than 
a millisecond or two,
though they could due to garbage collection, etc. Also note, that you are 
absolutely correct in
that the "client is responsible for setting the event date-time" but remember 
that the client
in this sense is actually the NiFi Framework. The repo is not expected to be 
used by arbitrary
clients.

The way this is used, though, is simply to indicate to the user the timestamp 
of the earliest event
so that they have an understanding of how far back the repository goes. So, for 
instance, if they are
interested in searching the repo for events that happened to some piece of data 
that they think was
received 2 weeks ago, they should know immediately if the events will be there 
or not.


>
> 3. Add a "clear" method to allow the users to manually empty the repository.
>

We could certainly do this, though I don't honestly understand the use case. 
Can you explain why
it would be beneficial to allow someone to clear the entire repository?

> 4. Remove method "getMaxEventId". In an implementation that uses
> UUID/Hash ids, there is no such thing as a "max" ID.
>

See explanation above for #1 as to why I think we need to keep Event ID's as 
Longs.


> 5. Remove methods "registerEvent"/"registerEvents" in favor of a
> single "addEvent" that accepts a single ProvenanceEventRecord and
> returns its Serializable ID.
>

We use registerEvents so that if we have multiple events we can add them to the 
repository in a single
"transaction" and because it is far more efficient to store a batch of events 
than to store a single event.
The registerEvent method really is just a convenience method so that the code 
updating the repo doesn't
always have to wrap a single event in a Collection themselves if they want to 
update the repo with just
one event.


> 6. Introduce a ProvenanceEventRepositoryRuntimeException class (or
> something like that) instead of using explicit IO exceptions. As is,
> the IOException seems inconsistent across methods.
>

I definitely agree that it is inconsistent across methods. This needs to be 
addressed. I have created
a ticket to address this: https://issues.apache.org/jira/browse/NIFI-660.

I do, however, content that IOException is appropriate here. If the repo fails 
to update due to a problem writing
to disk or to a network or whatever, it should absolutely throw an IOException. 
Any time that we are interacting
with some sort of external device (be it network, disk, or whatever), there is 
a very real possibility that there will
be environmental problems (such as disk full) that will cause us to fail, and 
any code that calls into such a method
needs to be responsible for handling those conditions properly, so I feel a 
checked exception is appropriate.

> I have attached a sample of these changes though only at the
> Interface, not all changes required. Changing to a Serialize object
> would affect several classes.
>
> Thanks.
                                          

Reply via email to