I commented on the PR, but I’ll add this to the thread here.

Wouldn’t something like this lend itself to a ReportingTask?  If not the
current structure, a like structure
for records?

That would allow the destination to do time series analysis etc.
That is not saying there isn’t a case to have it in the Flow as well.



On May 24, 2018 at 08:05:29, Mike Thomsen (mikerthom...@gmail.com) wrote:

I wrote a processor that's inspired by one of the Groovy scripts we use at
that client. PR is here if anyone wants to take a look:

https://github.com/apache/nifi/pull/2737

It's called "RecordStats" and provides both a general record count
attribute and lets you specify record path operations to get stats on
individual field values as well. For example, if you have a field called
called "department" you can do this:

department_count (prop name) => /department

as a dynamic property which will produce the following:

{
"record_count": "100",
"department": "75",
"department.Engineering": "25",
"department.Marketing": "10",
"department.Operations": "25",
"department.Finance": "15"
}

The scenario we have that lead to this involves a lot of big queries and
full collection fetches from MongoDB often as much as 80GB at a time, so
they'd rather see a little slow down from examining those stats and being
able to get "accurate counts" than see things go lightning fast and not
have the insight into exactly what came out of those fetches.



On Tue, May 15, 2018 at 8:40 PM Koji Kawamura <ijokaruma...@gmail.com>
wrote:

> Hi Mike,
>
> I agree with the approach that enrich provenance events. In order to
> do so, we can use several places to embed meta-data:
>
> - FlowFile attributes: automatically mapped to a provenance event, but
> as Andy mentioned, we need to be careful not to put sensitive data.
> - Transit URI: when I developed NiFi Atlas integration, I used this as
> the primary source of what data a processor interact with. E.g. remote
> address, database, table ... etc.
> - The 'details' string. It might not be ideal solution, but
> ProvenanceReporter accepts additional 'details' string. We can embed
> whatever we want here.
>
> I'd map meta-data you mentioned as follows:
> 1. Source system. => Transit URI
> 2. Database/table/index/collection/etc. => Transit URI or FlowFile
> attribute. I think it's fine to put these into attribute.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have). => 'details' string
>
> What I learned from Atlas integration, it's really hard to design a
> complete standard set of attributes. I'd suggest use what NiFi
> framework provides currently.
>
> Thanks,
>
> Koji
>
> On Tue, May 15, 2018 at 8:15 AM, Andy LoPresto <alopre...@apache.org>
> wrote:
> > Maybe an ADDINFO event or FORK event could be used and a new flowfile
> with
> > the relevant attributes/content could be created. The flowfiles would
be
> > linked, but the “sensitive” information wouldn’t travel with the
> original.
> >
> > Andy LoPresto
> > alopre...@apache.org
> > alopresto.apa...@gmail.com
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 14, 2018, at 3:32 PM, Mike Thomsen <mikerthom...@gmail.com>
> wrote:
> >
> > Does the provenance system have the ability to add user-defined
key/value
> > pairs to a flowfile's provenance record at a particular processor?
> >
> > On Mon, May 14, 2018 at 6:11 PM Andy LoPresto <alopre...@apache.org>
> wrote:
> >
> > I would actually propose that this is added to the provenance but not
> > always put into the flowfile attributes. There are many scenarios in
> which
> > the data retrieval should be separated from the analysis/follow-on,
both
> > for visibility, responsibility, and security concerns. While I
> understand a
> > separate UpdateAttribute processor could be put in the downstream flow
to
> > remove these attributes, I would push for not adding them by default as
a
> > more secure approach. Perhaps this could be configurable on the Get*
> > processor via a boolean property, but I think doing it automatically by
> > default introduces some serious concerns.
> >
> >
> > Andy LoPresto
> > alopre...@apache.org
> > *alopresto.apa...@gmail.com <alopresto.apa...@gmail.com>*
> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69
> >
> > On May 13, 2018, at 11:48 AM, Mike Thomsen <mikerthom...@gmail.com>
> wrote:
> >
> > @Joe @Matt
> >
> > This is kinda related to the point that Joe made in the graph DB thread
> > about provenance. My thought here was that we need some standards on
> > enriching the metadata about what was fetched so that no matter how you
> > store the provenance, you can find some way to query it for questions
> like
> > when a data set was loaded into NiFi, how many records went through a
> > terminating processor, etc. IMO this could help batch-oriented
> > organizations feel more at ease with something stream-oriented like
NiFi.
> >
> > On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <mikerthom...@gmail.com>
> > wrote:
> >
> > I'd like to propose that all non-deprecated (or likely to be
deprecated)
> > Get/Fetch/Query processors get a standard convention for attributes
that
> > describe things like:
> >
> > 1. Source system.
> > 2. Database/table/index/collection/etc.
> > 3. The lookup criteria that was used (similar to the "query attribute"
> > some already have).
> >
> > Using GetMongo as an example, it would add something like this:
> >
> > source.url=mongodb://localhost:27017
> > source.database=testdb
> > source.collection=test_collection
> > source.query={ "username": "john.smith" }
> > source.criteria.username=john.smith //GetMongo would parse the query
and
> > add this.
> >
> > We have a use case where a team is coming from an extremely
> batch-oriented
> > view and really wants to know when "dataset X" was run. Our solution
was
> to
> > extract that from the result set because the dataset name is one of the
> > fields in the JSON body.
> >
> > I think this would help expand what you can do out of the box with
> > provenance tracking because it would provide a lot of useful
information
> > that could be stored in Solr or ES and then queried against terminating
> > processors' DROP events to get a solid window into when jobs were run
> > historically.
> >
> > Thoughts?
> >
> >
> >
> >
>

Reply via email to