Re: publish data to new Kafka topic

Jeff Zemerick Mon, 11 Feb 2019 06:38:26 -0800

Thanks! I will take a look at that.

Jeff


On Sun, Feb 10, 2019 at 11:10 PM Lionel, Liu <[email protected]> wrote:

> Correct, Griffin calculates data quality metrics, but sometimes people
> also care about the “failed” data. There is a way to sink the data into
> hdfs, like the miss matched data output in accuracy:
> https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87
> Functionally, Griffin supports “spark-sql” rule to output the filtered
> data you want, just configure the rule like this:
> {        "dsl.type": "spark-sql",        "out.dataframe.name": "failed",
>       "rule": "select * from source where age > 100",        "out":[
>   {            "type": "record"          }        ]      } The failed data
> would be written into the configured output hdfs directory in env.json.
> Furthermore, you want the failed data outputted into a kafka topic, that’s
> a new sink type unsupported in Griffin at current, you need to implement a
> new sink type to sink records.
>
> https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink
>
> https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala
>
>
> Thanks
> Lionel, Liu
>
> From: Jeff Zemerick
> Sent: 2019年2月10日 23:56
> To: [email protected]
> Subject: Re: publish data to new Kafka topic
>
> Yes, a "data filter" describes it well. I think what would work would be if
> there could be a Boolean property on a rule that says if the rule fails
> then filter out that data (by redirecting it to a separate Kafka topic).
> Since Griffin is focused on data quality measurement that type of
> functionality might be out of scope for Griffin.
>
> Thanks,
> Jeff
>
>
> On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <[email protected]> wrote:
>
> > Hi Jeff,
> >
> >
> >
> > Seems like you’re looking for a data filter. Originally, Griffin
> > calculates the data quality like accuracy, profiling, and the output of
> > Griffin would be the data quality metrics.
> >
> > In your case, what kind of data quality do you want to check? How to
> > define the success or failure of you data?
> >
> >
> >
> > Thanks
> > Lionel, Liu
> >
> >
> >
> > *From: *Jeff Zemerick <[email protected]>
> > *Sent: *2019年2月6日 21:41
> > *To: *[email protected]
> > *Subject: *publish data to new Kafka topic
> >
> >
> >
> > Hi Griffin devs,
> >
> >
> >
> > Continuing my email thread from users@ and to better clarify it, I have
> a
> >
> > Kafka topic with JSON data on it. I would like to perform quality checks
> on
> >
> > this data, and I would like for data that meets the quality checks to be
> >
> > published to a separate Kafka topic, while data that fails one or more
> >
> > quality checks is left on the original Kafka topic. Is something like
> this
> >
> > possible with Griffin? Please let me know if my use-case is not clear.
> >
> >
> >
> > Thanks,
> >
> > Jeff
> >
> >
> >
> >
> >
>
>
>

Re: publish data to new Kafka topic

Reply via email to