RE: publish data to new Kafka topic

Lionel, Liu Sun, 10 Feb 2019 20:11:06 -0800

Correct, Griffin calculates data quality metrics, but sometimes people also 
care about the “failed” data. There is a way to sink the data into hdfs, like 
the miss matched data output in accuracy: 
https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87
Functionally, Griffin supports “spark-sql” rule to output the filtered data you 
want, just configure the rule like this:
{


        "dsl.type": "spark-sql",



        "out.dataframe.name": "failed",

        "rule": "select * from source where age > 100",

        "out":[

          {

            "type": "record"



          }

        ]

      }
 The failed data would be written into the configured output hdfs directory in 
env.json.
Furthermore, you want the failed data outputted into a kafka topic, that’s a 
new sink type unsupported in Griffin at current, you need to implement a new 
sink type to sink records.
https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink
https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala


Thanks
Lionel, Liu

From: Jeff Zemerick
Sent: 2019年2月10日 23:56
To: [email protected]
Subject: Re: publish data to new Kafka topic

Yes, a "data filter" describes it well. I think what would work would be if
there could be a Boolean property on a rule that says if the rule fails
then filter out that data (by redirecting it to a separate Kafka topic).
Since Griffin is focused on data quality measurement that type of
functionality might be out of scope for Griffin.

Thanks,
Jeff


On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <[email protected]> wrote:

> Hi Jeff,
>
>
>
> Seems like you’re looking for a data filter. Originally, Griffin
> calculates the data quality like accuracy, profiling, and the output of
> Griffin would be the data quality metrics.
>
> In your case, what kind of data quality do you want to check? How to
> define the success or failure of you data?
>
>
>
> Thanks
> Lionel, Liu
>
>
>
> *From: *Jeff Zemerick <[email protected]>
> *Sent: *2019年2月6日 21:41
> *To: *[email protected]
> *Subject: *publish data to new Kafka topic
>
>
>
> Hi Griffin devs,
>
>
>
> Continuing my email thread from users@ and to better clarify it, I have a
>
> Kafka topic with JSON data on it. I would like to perform quality checks on
>
> this data, and I would like for data that meets the quality checks to be
>
> published to a separate Kafka topic, while data that fails one or more
>
> quality checks is left on the original Kafka topic. Is something like this
>
> possible with Griffin? Please let me know if my use-case is not clear.
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>

RE: publish data to new Kafka topic

Reply via email to