[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=468374=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468374 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 10/Aug/20 02:53 Start Date: 10/Aug/20 02:53 Worklog Time Spent: 10m Work Description: guoyuepeng commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-671141520 @chitralverma The merge process as following: use python 2.7 - run ./merge_pr.py - Which pull request would you like to merge? (e.g. 34): 575 - select 575 - Proceed with merging pull request #575? (y/n): y - Merge complete (local ref PR_TOOL_MERGE_PR_575_MASTER). Push to apache-git? (y/n): y - Would you like to pick 1aa8995a into another branch? (y/n): n - Would you like to update an associated JIRA? (y/n): y - Enter comma-separated fix version(s) [0.6.0]: You should have permission for this. tell me if you encounter any problem. Thanks, William This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 468374) Time Spent: 2h 10m (was: 2h) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Fix For: 0.6.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=468373=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-468373 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 10/Aug/20 02:50 Start Date: 10/Aug/20 02:50 Worklog Time Spent: 10m Work Description: asfgit closed pull request #575: URL: https://github.com/apache/griffin/pull/575 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 468373) Time Spent: 2h (was: 1h 50m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=467097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467097 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 06/Aug/20 06:41 Start Date: 06/Aug/20 06:41 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-669736481 absolutely, I'm all in for Griffin. :) @wankunde can you please merge this. Also, can you tell me how the requests are merged for this project so that I can help close some of the open PRs. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467097) Time Spent: 1h 50m (was: 1h 40m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=467088=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-467088 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 06/Aug/20 06:20 Start Date: 06/Aug/20 06:20 Worklog Time Spent: 10m Work Description: wankunde commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-669727067 LGTM, @chitralverma Many thanks for your work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 467088) Time Spent: 1h 40m (was: 1.5h) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=452026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-452026 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 28/Jun/20 12:25 Start Date: 28/Jun/20 12:25 Worklog Time Spent: 10m Work Description: wankunde commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-650745454 Hi, @chitralverma , could you provide an implementation example of the `open` and `close` methods? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 452026) Time Spent: 1h 10m (was: 1h) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=450431=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-450431 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 24/Jun/20 13:17 Start Date: 24/Jun/20 13:17 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-648813731 @wankunde the `open` and `close` methods are for future custom sinks implementations, for example, Redis, JDBC etc that do not rely on spark datasource v1/ v2. Such data sources require one-time initialization of connection/ connection pool which can then be serialized to all executor each time the write operation is called. This PR also acts as a basic cleanup for structured streaming sinks which I'm working on. I'm also planning to rewrite HDFSSink as FileBasedSink much like FileBasedDataConnector and include many other sinks. Griffin is going to get really exciting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 450431) Time Spent: 1h (was: 50m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=450428=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-450428 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 24/Jun/20 13:11 Start Date: 24/Jun/20 13:11 Worklog Time Spent: 10m Work Description: chitralverma commented on a change in pull request #575: URL: https://github.com/apache/griffin/pull/575#discussion_r444881499 ## File path: measure/src/main/scala/org/apache/griffin/measure/sink/Sink.scala ## @@ -18,30 +18,57 @@ package org.apache.griffin.measure.sink import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame import org.apache.griffin.measure.Loggable /** - * sink metric and record + * Base trait for batch and Streaming Sinks. + * To implement custom sinks, extend your classes with this trait. */ trait Sink extends Loggable with Serializable { - val metricName: String + + val jobName: String Review comment: absolutely, I had the same in mind but I was planning to do it as part of a separate config refactoring in the near future. Do you suggest I do this right now or later? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 450428) Time Spent: 50m (was: 40m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=450426=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-450426 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 24/Jun/20 13:10 Start Date: 24/Jun/20 13:10 Worklog Time Spent: 10m Work Description: chitralverma commented on a change in pull request #575: URL: https://github.com/apache/griffin/pull/575#discussion_r444880634 ## File path: measure/src/main/scala/org/apache/griffin/measure/sink/Sink.scala ## @@ -18,30 +18,57 @@ package org.apache.griffin.measure.sink import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame import org.apache.griffin.measure.Loggable /** - * sink metric and record + * Base trait for batch and Streaming Sinks. + * To implement custom sinks, extend your classes with this trait. */ trait Sink extends Loggable with Serializable { - val metricName: String + + val jobName: String val timeStamp: Long val config: Map[String, Any] val block: Boolean - def available(): Boolean + /** + * Ensures that the pre-requisites (if any) of the Sink are met before opening it. + */ + def validate(): Boolean - def start(msg: String): Unit - def finish(): Unit + /** + * Allows initialization of the connection to the sink (if required). + * + * @param applicationId Spark Application ID + */ + def open(applicationId: String): Unit Review comment: @wankunde this is as per the existing implementation. I just changed the variable names to remove ambiguity and made no functional change. This has been done in many other places also. I'll refactor the applicationId in favor of more description soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 450426) Time Spent: 40m (was: 0.5h) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=450388=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-450388 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 24/Jun/20 11:05 Start Date: 24/Jun/20 11:05 Worklog Time Spent: 10m Work Description: wankunde commented on a change in pull request #575: URL: https://github.com/apache/griffin/pull/575#discussion_r444802163 ## File path: measure/src/main/scala/org/apache/griffin/measure/sink/Sink.scala ## @@ -18,30 +18,57 @@ package org.apache.griffin.measure.sink import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame import org.apache.griffin.measure.Loggable /** - * sink metric and record + * Base trait for batch and Streaming Sinks. + * To implement custom sinks, extend your classes with this trait. */ trait Sink extends Loggable with Serializable { - val metricName: String + + val jobName: String val timeStamp: Long val config: Map[String, Any] val block: Boolean - def available(): Boolean + /** + * Ensures that the pre-requisites (if any) of the Sink are met before opening it. + */ + def validate(): Boolean - def start(msg: String): Unit - def finish(): Unit + /** + * Allows initialization of the connection to the sink (if required). + * + * @param applicationId Spark Application ID + */ + def open(applicationId: String): Unit Review comment: What's the use of `applicationId `? Can we use `jobName` instead? ## File path: measure/src/main/scala/org/apache/griffin/measure/sink/Sink.scala ## @@ -18,30 +18,57 @@ package org.apache.griffin.measure.sink import org.apache.spark.rdd.RDD +import org.apache.spark.sql.DataFrame import org.apache.griffin.measure.Loggable /** - * sink metric and record + * Base trait for batch and Streaming Sinks. + * To implement custom sinks, extend your classes with this trait. */ trait Sink extends Loggable with Serializable { - val metricName: String + + val jobName: String Review comment: It's better to unify the names of variable, and easier to understand. In `DQConfig` is `name`, in `BatchDQApp` is `metricName`, in `DQContext` is `name`, in `SinkFactory` is `jobName`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 450388) Time Spent: 0.5h (was: 20m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=447980=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447980 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 18/Jun/20 18:48 Start Date: 18/Jun/20 18:48 Worklog Time Spent: 10m Work Description: chitralverma commented on pull request #575: URL: https://github.com/apache/griffin/pull/575#issuecomment-646243010 @guoyuepeng @wankunde Can you please review this. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 447980) Time Spent: 20m (was: 10m) > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (GRIFFIN-305) Standardize Sink Hierarchy
[ https://issues.apache.org/jira/browse/GRIFFIN-305?focusedWorklogId=447342=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-447342 ] ASF GitHub Bot logged work on GRIFFIN-305: -- Author: ASF GitHub Bot Created on: 17/Jun/20 15:06 Start Date: 17/Jun/20 15:06 Worklog Time Spent: 10m Work Description: chitralverma opened a new pull request #575: URL: https://github.com/apache/griffin/pull/575 **What changes were proposed in this pull request?** Currently, the implementation of `Sinks` in Griffin poses the below issues. This PR aims at fixing these issues. - `Sinks` are based on the recursive MultiSink class which is a sink itself but the underlying implementation is that of a `Seq` which causes ambiguity and isn't much useful. This has been removed. - Some unused code like `SinkContext` has been removed. - Data is converted from the performant DataFrame to RDD while persisting in both streaming and batch pipelines. A new method `sinkBatchRecords` has been added to allow operations directly on DataFrame for batch pipelines. Streaming will still use the old implementation which will be replaced with structured streaming. - Refactored the methods of `Sink` like changed `start`/ `finish` to `open`/ `close` and `jobName` was incorrectly passed as `metricName`. - Presently, only one instance of a sink with a given type can be defined in the env config. This will not allow the cases where you want to configure multiple sinks of same type like HDFS or JDBC. Added sink `name` to env config which is used to define the sink that should be used in the job config also. - Updated all sinks as per the changes above. With some additional changes to ConsoleSink **Does this PR introduce any user-facing change?** Yes. As mentioned above, the sink config has changed in env and job configs. How was this patch tested? Griffin test suite and additional unit test cases This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 447342) Remaining Estimate: 0h Time Spent: 10m > Standardize Sink Hierarchy > -- > > Key: GRIFFIN-305 > URL: https://issues.apache.org/jira/browse/GRIFFIN-305 > Project: Griffin > Issue Type: Sub-task >Reporter: Chitral Verma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)