[jira] [Updated] (NIFI-7745) Add a SampleRecord processor

Matt Burgess (Jira) Sun, 16 Aug 2020 10:01:04 -0700


     [ 
https://issues.apache.org/jira/browse/NIFI-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matt Burgess updated NIFI-7745:
-------------------------------
    Description: 
Sampling records in a flowfile can be a helpful way to test with "real" data, 
especially for source systems that contain large datasets. It may not be 
possible on the source system to sample the data or test NiFi flows on smaller 
datasets from the source system(s). Sampling in NiFi may be currently possible 
(such as QueryRecord with row numbers), but is likely done in-memory (in the 
QueryRecord case) or in a simplistic fashion.

This Jira proposes a SampleRecord processor that should offer (at the least) 
the following sampling options:

Interval Sampling (every Nth record)
Probabilistic Sampling (each record has a probability P of being chosen)
Reservoir Sampling (A sample of size K with each record having equal 
probability of being chosen)


  was:
Sampling records in a flowfile can be a helpful way to test with "real" data, 
especially for source systems that contain large datasets. It may not be 
possible on the source system to sample the data or test NiFi flows on smaller 
datasets from the source system(s). Sampling in NiFi may be currently possible 
(such as QueryRecord with row numbers), but is likely done in-memory (in the 
QueryRecord case) or in a simplistic fashion.

This Jira proposes a SampleRecord processor that should offer (at the least) 
the following sampling options:

Interval Sampling (every Nth record)
Probabilistic Sampling (each record has a probability P of being chosen)
Reservoir Sampling (A sample of size K with each record having equal 
probability of being chosen)
Weighted Random Sampling (Records are chosen with probabilities weighted by the 
number of occurrences of a value of a specified field in the record)



> Add a SampleRecord processor
> ----------------------------
>
>                 Key: NIFI-7745
>                 URL: https://issues.apache.org/jira/browse/NIFI-7745
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>            Priority: Major
>
> Sampling records in a flowfile can be a helpful way to test with "real" data, 
> especially for source systems that contain large datasets. It may not be 
> possible on the source system to sample the data or test NiFi flows on 
> smaller datasets from the source system(s). Sampling in NiFi may be currently 
> possible (such as QueryRecord with row numbers), but is likely done in-memory 
> (in the QueryRecord case) or in a simplistic fashion.
> This Jira proposes a SampleRecord processor that should offer (at the least) 
> the following sampling options:
> Interval Sampling (every Nth record)
> Probabilistic Sampling (each record has a probability P of being chosen)
> Reservoir Sampling (A sample of size K with each record having equal 
> probability of being chosen)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (NIFI-7745) Add a SampleRecord processor

Reply via email to