Correct, Griffin calculates data quality metrics, but sometimes people also
care about the “failed” data. There is a way to sink the data into hdfs, like
the miss matched data output in accuracy:
https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87
Functionally, Griffin supports “spark-sql” rule to output the filtered data you
want, just configure the rule like this:
{
"dsl.type": "spark-sql",
"out.dataframe.name": "failed",
"rule": "select * from source where age > 100",
"out":[
{
"type": "record"
}
]
}
The failed data would be written into the configured output hdfs directory in
env.json.
Furthermore, you want the failed data outputted into a kafka topic, that’s a
new sink type unsupported in Griffin at current, you need to implement a new
sink type to sink records.
https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink
https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala
Thanks
Lionel, Liu
From: Jeff Zemerick
Sent: 2019年2月10日 23:56
To: [email protected]
Subject: Re: publish data to new Kafka topic
Yes, a "data filter" describes it well. I think what would work would be if
there could be a Boolean property on a rule that says if the rule fails
then filter out that data (by redirecting it to a separate Kafka topic).
Since Griffin is focused on data quality measurement that type of
functionality might be out of scope for Griffin.
Thanks,
Jeff
On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <[email protected]> wrote:
> Hi Jeff,
>
>
>
> Seems like you’re looking for a data filter. Originally, Griffin
> calculates the data quality like accuracy, profiling, and the output of
> Griffin would be the data quality metrics.
>
> In your case, what kind of data quality do you want to check? How to
> define the success or failure of you data?
>
>
>
> Thanks
> Lionel, Liu
>
>
>
> *From: *Jeff Zemerick <[email protected]>
> *Sent: *2019年2月6日 21:41
> *To: *[email protected]
> *Subject: *publish data to new Kafka topic
>
>
>
> Hi Griffin devs,
>
>
>
> Continuing my email thread from users@ and to better clarify it, I have a
>
> Kafka topic with JSON data on it. I would like to perform quality checks on
>
> this data, and I would like for data that meets the quality checks to be
>
> published to a separate Kafka topic, while data that fails one or more
>
> quality checks is left on the original Kafka topic. Is something like this
>
> possible with Griffin? Please let me know if my use-case is not clear.
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>