Thanks! I will take a look at that. Jeff
On Sun, Feb 10, 2019 at 11:10 PM Lionel, Liu <[email protected]> wrote: > Correct, Griffin calculates data quality metrics, but sometimes people > also care about the “failed” data. There is a way to sink the data into > hdfs, like the miss matched data output in accuracy: > https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/step/builder/dsl/transform/AccuracyExpr2DQSteps.scala#L87 > Functionally, Griffin supports “spark-sql” rule to output the filtered > data you want, just configure the rule like this: > { "dsl.type": "spark-sql", "out.dataframe.name": "failed", > "rule": "select * from source where age > 100", "out":[ > { "type": "record" } ] } The failed data > would be written into the configured output hdfs directory in env.json. > Furthermore, you want the failed data outputted into a kafka topic, that’s > a new sink type unsupported in Griffin at current, you need to implement a > new sink type to sink records. > > https://github.com/apache/griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/sink > > https://github.com/apache/griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/configuration/enums/SinkType.scala > > > Thanks > Lionel, Liu > > From: Jeff Zemerick > Sent: 2019年2月10日 23:56 > To: [email protected] > Subject: Re: publish data to new Kafka topic > > Yes, a "data filter" describes it well. I think what would work would be if > there could be a Boolean property on a rule that says if the rule fails > then filter out that data (by redirecting it to a separate Kafka topic). > Since Griffin is focused on data quality measurement that type of > functionality might be out of scope for Griffin. > > Thanks, > Jeff > > > On Sun, Feb 10, 2019 at 9:42 AM Lionel, Liu <[email protected]> wrote: > > > Hi Jeff, > > > > > > > > Seems like you’re looking for a data filter. Originally, Griffin > > calculates the data quality like accuracy, profiling, and the output of > > Griffin would be the data quality metrics. > > > > In your case, what kind of data quality do you want to check? How to > > define the success or failure of you data? > > > > > > > > Thanks > > Lionel, Liu > > > > > > > > *From: *Jeff Zemerick <[email protected]> > > *Sent: *2019年2月6日 21:41 > > *To: *[email protected] > > *Subject: *publish data to new Kafka topic > > > > > > > > Hi Griffin devs, > > > > > > > > Continuing my email thread from users@ and to better clarify it, I have > a > > > > Kafka topic with JSON data on it. I would like to perform quality checks > on > > > > this data, and I would like for data that meets the quality checks to be > > > > published to a separate Kafka topic, while data that fails one or more > > > > quality checks is left on the original Kafka topic. Is something like > this > > > > possible with Griffin? Please let me know if my use-case is not clear. > > > > > > > > Thanks, > > > > Jeff > > > > > > > > > > > > >
