Re: Griffin Clarifications

William Guo Thu, 06 Dec 2018 21:35:11 -0800

Hi Nidhin,

First of all, thanks for your questions. I reply your questions inline
below.

*https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance
<https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance>
states
right now Griffin DSL supports only hive and avro as data source and hive,
json and avro as data formats. We have other data sources/formats as well.
So from the documentation what I understood is if Griffin DSL is not
supported I can use spark-sql. Is that correct? So using spark sql can I do
the similar kind of configuration for a parquet file residing in s3 and get
the metrics?*

Yes, if Griffin DSL cannot support your requirements, you can always use
spark-sql for your use cases.
For test in s3 environment, xuexu is working on this ticket
https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-217

*Griffin persists the monitored metrics in elastic cache? If so can I
configure it to use an elastic cache which is outside griffin docker? Can
you point me to the documentation for that?*

You are right, Griffin use elastic search as metrics storage. You can
specify your ES in application.properties
https://github.com/apache/griffin/blob/master/service/src/main/resources/application.properties#L56-L58

*On the similar note, we are a complete aws shop, any of the active users
use griffin in aws? Is there any documentation available? If griffin
submits the spark job via livy, I think it should be okay even if we use
emr right?*

For AWS, since we don't have AWS environment in production, we are using
our own private cloud.
As said , xuexu is investigating this issue, maybe you can help us to
deploy on aws and contribute on this part.

*How can I do Completeness, Consistency and Validity measures? Is it a
future road map item? If so when do you have an GA dates?*

For Completeness, Consistency, Validity, The logic is the same as accuracy,
you just specify the rule and griffin should take care of it.
but could you tell us your completeness, consistency, and validity
requirements in details.
If apache griffin community think we need to add external code to support
completeness, consistency, validity, we will discuss in mailing list.

Thanks,
William

On Fri, Dec 7, 2018 at 1:31 PM Karunakaran Ponon, Nidhin (HBO)
<[email protected]> wrote:

> Hi,
>
> I work in HBO’s Data Engineering team. We are evaluating multiple tools as
> part of implementing our Data Quality framework. I came across Griffin and
> it looks very promising. I have couple of doubts. It would be great if you
> can clarify them. And our use cases are mostly batch for now.
>
>
>   1.
> https://cwiki.apache.org/confluence/display/GRIFFIN/7.+Apache+Griffin+DSL+Guidance
> states right now Griffin DSL supports only hive and avro as data source and
> hive, json and avro as data formats. We have other data sources/formats as
> well. So from the documentation what I understood is if Griffin DSL is not
> supported I can use spark-sql. Is that correct? So using spark sql can I do
> the similar kind of configuration for a parquet file residing in s3 and get
> the metrics?
>   2.  Griffin persists the monitored metrics in elastic cache? If so can I
> configure it to use an elastic cache which is outside griffin docker? Can
> you point me to the documentation for that?
>   3.  On the similar note, we are a complete aws shop, any of the active
> users use griffin in aws? Is there any documentation available? If griffin
> submits the spark job via livy, I think it should be okay even if we use
> emr right?
>   4.  How can I do Completeness, Consistency and Validity measures? Is it
> a future road map item? If so when do you have an GA dates?
>
> Also we are happy to contribute if it adds value to our requirement.
>
> Thanks,
> Nidhin
>

Re: Griffin Clarifications

Reply via email to