Hi Nidhin, First of all, thanks for your questions. I reply your questions inline below.
*https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance <https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance> states right now Griffin DSL supports only hive and avro as data source and hive, json and avro as data formats. We have other data sources/formats as well. So from the documentation what I understood is if Griffin DSL is not supported I can use spark-sql. Is that correct? So using spark sql can I do the similar kind of configuration for a parquet file residing in s3 and get the metrics?* Yes, if Griffin DSL cannot support your requirements, you can always use spark-sql for your use cases. For test in s3 environment, xuexu is working on this ticket https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-217 *Griffin persists the monitored metrics in elastic cache? If so can I configure it to use an elastic cache which is outside griffin docker? Can you point me to the documentation for that?* You are right, Griffin use elastic search as metrics storage. You can specify your ES in application.properties https://github.com/apache/griffin/blob/master/service/src/main/resources/application.properties#L56-L58 *On the similar note, we are a complete aws shop, any of the active users use griffin in aws? Is there any documentation available? If griffin submits the spark job via livy, I think it should be okay even if we use emr right?* For AWS, since we don't have AWS environment in production, we are using our own private cloud. As said , xuexu is investigating this issue, maybe you can help us to deploy on aws and contribute on this part. *How can I do Completeness, Consistency and Validity measures? Is it a future road map item? If so when do you have an GA dates?* For Completeness, Consistency, Validity, The logic is the same as accuracy, you just specify the rule and griffin should take care of it. but could you tell us your completeness, consistency, and validity requirements in details. If apache griffin community think we need to add external code to support completeness, consistency, validity, we will discuss in mailing list. Thanks, William On Fri, Dec 7, 2018 at 1:31 PM Karunakaran Ponon, Nidhin (HBO) <[email protected]> wrote: > Hi, > > I work in HBO’s Data Engineering team. We are evaluating multiple tools as > part of implementing our Data Quality framework. I came across Griffin and > it looks very promising. I have couple of doubts. It would be great if you > can clarify them. And our use cases are mostly batch for now. > > > 1. > https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance > states right now Griffin DSL supports only hive and avro as data source and > hive, json and avro as data formats. We have other data sources/formats as > well. So from the documentation what I understood is if Griffin DSL is not > supported I can use spark-sql. Is that correct? So using spark sql can I do > the similar kind of configuration for a parquet file residing in s3 and get > the metrics? > 2. Griffin persists the monitored metrics in elastic cache? If so can I > configure it to use an elastic cache which is outside griffin docker? Can > you point me to the documentation for that? > 3. On the similar note, we are a complete aws shop, any of the active > users use griffin in aws? Is there any documentation available? If griffin > submits the spark job via livy, I think it should be okay even if we use > emr right? > 4. How can I do Completeness, Consistency and Validity measures? Is it > a future road map item? If so when do you have an GA dates? > > Also we are happy to contribute if it adds value to our requirement. > > Thanks, > Nidhin >
