RE: Apache Griffin Profiling

Karan Gupta Mon, 28 May 2018 02:08:49 -0700

Hi Lionel,

The entry point for my data flow are csv files on which I want to run profiling 
jobs instead of hive tables. These csv files will be subjected to profiling and 
health check before moving them into the data flow. Such files will be on HDFS. 
Hence, I have couple of questions here


Does profiling support files instead of hive tables? If yes, can I point my 
“data.source” to an HDFS directory instead of specifying a file each time, so 
that the griffin will run the profiling job on each newly added file in that 
HDFS location.

Thank you,
Karan Gupta


From: Lionel Liu <[email protected]>
Sent: Monday, May 28, 2018 1:58 PM
To: Karan Gupta <[email protected]>; [email protected]
Subject: Re: Apache Griffin Profiling

Hi Karan,

Do you mean that you want to put your profiling config files in a HDFS 
directory, and let griffin scan the directory to get the config files at run 
time?

Griffin measure module doesn't support this at current, you can refer to the 
code entrance and implement your own param file reader if you want to do that:
https://github.com/apache/incubator-griffin/blob/master/measure/src/main/scala/org/apache/griffin/measure/Application.scala#L170<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-griffin%2Fblob%2Fmaster%2Fmeasure%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fgriffin%2Fmeasure%2FApplication.scala%23L170&data=01%7C01%7Ckaran.gupta%40tavant.com%7Ca6d29fd95ec94504c97808d5c474f525%7Cc6c1e9da5d0c4f8f9a023c67206efbd6%7C0&sdata=sU9FmRMia3p9BNwNo9Ou1kcEy1wo1BYlUNO8ry5KM6o%3D&reserved=0>
https://github.com/apache/incubator-griffin/tree/master/measure/src/main/scala/org/apache/griffin/measure/config/reader<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-griffin%2Ftree%2Fmaster%2Fmeasure%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fgriffin%2Fmeasure%2Fconfig%2Freader&data=01%7C01%7Ckaran.gupta%40tavant.com%7Ca6d29fd95ec94504c97808d5c474f525%7Cc6c1e9da5d0c4f8f9a023c67206efbd6%7C0&sdata=pme%2BSSV0lcw58WMHNrvwtl1vTSpBtD%2BajAnEzdzsoFk%3D&reserved=0>

But in my opinion, maybe it's not appropriate to do such work in measure 
module. This seems like to be some schedule work before submitting griffin jobs.

Thanks,
Lionel


On Mon, May 28, 2018 at 3:21 PM, Karan Gupta 
<[email protected]<mailto:[email protected]>> wrote:
Hi Lionel,

Thank you for your response, I created a single custom rule for multiple 
sources. Now I am trying to run profiling jobs where my source is not tightly 
coupled inside a rule. I want to run profiling jobs by just pointing to a HDFS 
directory instead of a specific file <griffin should pick up the file name from 
the directory on run time>
Is it possible to do that through Griffin?


Thank you,
Karan Gupta
________________________________
Any comments or statements made in this email are not necessarily those of 
Tavant Technologies. The information transmitted is intended only for the 
person or entity to which it is addressed and may contain confidential and/or 
privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All emails sent from or to 
Tavant Technologies may be subject to our monitoring procedures.

RE: Apache Griffin Profiling

Reply via email to