[incubator-devlake] branch main updated: docs: Update README.md (#4952)

hez Tue, 18 Apr 2023 18:07:56 -0700

This is an automated email from the ASF dual-hosted git repository.

hez pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-devlake.git



The following commit(s) were added to refs/heads/main by this push:
     new a7fd6efb1 docs: Update README.md (#4952)
a7fd6efb1 is described below

commit a7fd6efb1bbca928f1c0dd04940d8019fed115f0
Author: Camille Teruel <[email protected]>
AuthorDate: Wed Apr 19 03:06:41 2023 +0200

    docs: Update README.md (#4952)
    
    Co-authored-by: Camille Teruel <[email protected]>
---
 backend/python/README.md | 228 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 178 insertions(+), 50 deletions(-)

diff --git a/backend/python/README.md b/backend/python/README.md
index 85093847e..d0658c94a 100644
--- a/backend/python/README.md
+++ b/backend/python/README.md
@@ -1,12 +1,12 @@
 # Pydevlake
 
-A framework to write data collection plugins for 
[DevLake](https://devlake.apache.org/). The framework source code
-can be found in [here](./pydevlake) and the plugin source code 
[here](./pydevlake).
+Pydevlake is a framework for writing plugins plugins for 
[DevLake](https://devlake.apache.org/). The framework source code
+can be found in [here](./pydevlake).
 
 
 # How to create a new plugin
 
-## Create plugin project
+## Create the plugin project
 
 
 Make sure you have [Poetry](https://python-poetry.org/docs/#installation) 
installed.
@@ -15,7 +15,7 @@ This will generate a new directory for your plugin.
 
 In the `pyproject.toml` file and add the following line at the end of the 
`[tool.poetry.dependencies]` section:
 ```
-pydevlake = { path = "../../pydevlake", develop = false }
+pydevlake = { path = "../../pydevlake", develop = true }
 ```
 
 Now run `poetry install`.
@@ -25,24 +25,40 @@ Now run `poetry install`.
 Create a `main.py` file with the following content:
 
 ```python
-from pydevlake import Plugin, Connection
+from typing import Iterable
 
+import pydevlake as dl
 
-class MyPluginConnection(Connection):
+
+class MyPluginConnection(dl.Connection):
     pass
 
 
-class MyPlugin(Plugin):
-    @property
-    def connection_type(self):
-        return MyPluginConnection
+class MyPluginTransformationRule(dl.TransformationRule):
+    pass
 
-    def test_connection(self, connection: MyPluginConnection):
-        pass
 
-    @property
-    def streams(self):
-        return []
+class MyPluginToolScope(dl.ToolScope):
+    pass
+
+
+class MyPlugin(dl.Plugin):
+    connection_type = MyPluginConnection
+    transformation_rule_type =  MyPluginTransformationRule
+    tool_scope_type = MyPluginToolScope
+    streams = []
+
+    def domain_scopes(self, tool_scope: MyScope) -> Iterable[dl.DomainScope]:
+        ...
+
+    def remote_scope_groups(self, connection: MyPluginConnection) -> 
Iterable[dl.RemoteScopeGroup]:
+        ...
+
+    def remote_scopes(self, connection, group_id: str) -> 
Iterable[MyPluginToolScope]:
+        ...
+
+    def test_connection(self, connection: MyPluginConnection):
+        ...
 
 
 if __name__ == '__main__':
@@ -50,20 +66,122 @@ if __name__ == '__main__':
 ```
 
 This file is the entry point to your plugin.
-It specifies three things:
-- the parameters that your plugin needs to collect data, e.g. the url and 
credentials to connect to the datasource or custom options
-- how to validate that some given parameters allows to connect to the 
datasource, e.g. test whether the credentials are correct
-- the list of data streams that this plugin can collect
+It specifies three datatypes:
+- A connection that groups the parameters that your plugin needs to collect 
data, e.g. the url and credentials to connect to the datasource
+- A transformation rule that groups the parameters that your plugin uses to 
convert some data, e.g. regexes to match issue type from name.
+- A tool layer scope type that represents the top-level entity of this plugin, 
e.g. a board, a repository, a project, etc.
+
+The plugin class declares what are its connection, transformation rule and 
tool scope types.
+It also declares its list of streams, and is responsible to define 4 methods 
that we'll cover hereafter.
 
 
 ### Connection parameters
 
-The parameters of your plugin are defined as class attributes of the 
connection class.
-For example to add a `url` parameter of type `str` edit `MyPLuginConnection` 
as follow:
+The parameters of your plugin split between those that are required to connect 
to the datasource that are grouped in your connection class
+and those that are used to customize conversion to domain models that are 
grouped in your transformation rule class.
+For example, to add `url` and `token` parameter, edit `MyPluginConnection` as 
follow:
 
 ```python
 class MyPluginConnection(Connection):
     url: str
+    token: str
+```
+
+All plugin methods that have a connection parameter will be called with an 
instance of this class.
+Note that you should not define `__init__`.
+
+### Transformation rule parameters
+
+
+Transformation rules are used to customize the conversion of data from the 
tool layer to the domain layer. For example, you can define a regex to match 
issue type from issue name.
+
+```python
+class MyPluginTransformationRule(TransformationRule):
+    issue_type_regex: str
+```
+
+Not all plugins need transformation rules, so you can omit this class.
+
+
+### Tool scope type
+
+The tool scope type is the top-level entity type of your plugin.
+For example, a board, a repository, a project, etc.
+A scope is connected to a connection, and all other collected entities are 
related to a scope.
+For example, a plugin for Jira will have a tool scope type of `Board`, and all 
other entities, such as issues, will belong to a single board.
+
+
+### Implement domain_scopes method
+
+
+The `domain_scopes` method should return the list of domain scopes that are 
related to a given tool scope. Usually, this consists of a single domain scope, 
but it can be more than one for plugins that collect data from multiple domains.
+
+
+```python
+from pydevlake.domain_layer.devops import CicdScope
+...
+
+class MyPlugin(dl.Plugin):
+    ...
+
+    def domain_scopes(self, tool_scope: MyPluginToolScope) -> 
list[dl.DomainScope]:
+        yield CicdScope(
+            name=tool_scope.name,
+            description=tool_scope.description,
+            url=tool_scope.url,
+        )
+
+```
+
+
+### Implement `remote_scope` and `remote_scope_groups` method
+
+Those two methods are used by DevLake to list the available scopes in the 
datasource.
+The `remote_scope_groups` method should return a list of scope "groups" and 
the `remote_scopes` method should return the list of tool scopes in a given 
group.
+
+
+```python
+class MyPlugin(dl.Plugin):
+    ...
+
+    def remote_scope_groups(self, connection: MyPluginConnection) -> 
Iterable[dl.RemoteScopeGroup]:
+        api = ...
+        response = ...
+        for raw_group in response:
+            yield RemoteScopeGroup(
+                id=raw_group.id,
+                name=raw_group.name,
+            )
+
+    def remote_scopes(self, connection, group_id: str) -> 
Iterable[MyPluginToolScope]:
+        api = ...
+        response = ...
+        for raw_scope in response:
+            yield MyPluginToolScope(
+                id=raw_scope['id'],
+                name=raw_scope['name'],
+                description=raw_scope['description'],
+                url=raw_scope['url'],
+            )
+```
+
+### Implement `test_connection` method
+
+The `test_connection` method is used to test if a given connection is valid.
+It should check that the connection credentials are valid.
+If the connection is not valid, it should raise an exception.
+
+```python
+class MyPlugin(dl.Plugin):
+    ...
+
+    def test_connection(self, connection: MyPluginConnection):
+        api = ...
+        response = ...
+        if response.status_code != 401:
+            raise Exception("Invalid credentials")
+        if response.status_code != 200:
+            raise Exception(f"Connection error {response}")
 ```
 
 
@@ -92,6 +210,9 @@ class User(ToolModel, table=True):
 Your tool model must declare at least one attribute as a primary key, like 
`id` in the example above.
 It must inherit from `ToolModel`, which in turn inherit from `SQLModel`, the 
base class of an [ORM of the same name](https://sqlmodel.tiangolo.com/).
 You can use `SQLModel` features like [declaring relationships with other 
models](https://sqlmodel.tiangolo.com/tutorial/relationship-attributes/).
+Do not forget `table=True`, otherwise no table will be created in the 
database. You can omit it for abstract model classes.
+
+To facilitate or even eliminate extraction, your tool models should be close 
to the raw data you collect. Note that if you collect data from a JSON REST API 
that uses camelCased properties, you can still define snake_cased attributes in 
your model. The camelCased attributes aliases will be generated, so no special 
care is needed during extraction.
 
 
 ### Create the stream class
@@ -100,18 +221,21 @@ Create a new file for your first stream in a `streams` 
directory.
 
 ```python
 from pydevlake import Stream, DomainType
-from pydevlake.domain_layer.crossdomain import User as DomainUser
+import pydevlake.domain_layer.crossdomain as cross
 
-from myplugin.models import User as ToolUser
+from myplugin.models import User
 
 
 class Users(Stream):
     tool_model = ToolUser
-    domain_types = [DomainType.CROSS]
+    domain_models = [cross.User]
 
     def collect(self, state, context) -> Iterable[Tuple[object, dict]]:
         pass
 
+    def extract(self, raw_data, context) -> ToolUser:
+        pass
+
     def convert(self, user: ToolUser, context) -> Iterable[DomainUser]:
         pass
 ```
@@ -119,13 +243,17 @@ class Users(Stream):
 This stream will collect raw user data, e.g. as parsed JSON objects, extract 
this raw data as your
 tool-specific user model, then convert them into domain-layer user models.
 
-The `tool_model` class attribute declares the tool model class that is 
extracted by this strem.
-The `domain_types` class attribute is a list of domain types this stream is 
about.
+The `tool_model` class attribute declares the tool model class that is 
extracted by this stream.
+The `domain_domain` class attribute is a list of domain models that are 
converted from the tool model.
+Most of the time, you will convert a tool model into a single domain model, 
but need to convert it into multiple domain models.
 
 The `collect` method takes a `state` dictionary and a context object and 
yields tuples of raw data and new state.
 The last state that the plugin yielded for a given connection will be reused 
during the next collection.
 The plugin can use this `state` to store information necessary to perform 
incremental collection of data.
 
+The `extract` method takes a raw data object and a context object and returns 
a tool model. This method has a default implementation that uses the 
`tool_model` class attribute to create a new instance of the tool model and set 
its attributes from the raw data (`self.tool_model(**raw_data)`).
+If the raw data collected from the datasource and is simple enough and well 
aligned with your tool model, you can omit this method.
+Otherwise, you can override it to deal with e.g. nested data structures.
 
 The `convert` method takes a tool-specific user model and convert it into 
domain level user models.
 Here the two models align quite well, the conversion is trivial:
@@ -139,6 +267,30 @@ def convert(self, user: ToolUser, context: Context) -> 
Iterable[DomainUser]:
     )
 ```
 
+
+#### Substreams
+
+Sometimes, a datasource is organized hierarchically. For example, in Jira an 
issue have many comments.
+In this case, you can create a substream to collect the comments of an issue.
+A substream is a stream that is executed for each element of a parent stream.
+The parent tool model, in our example an issue, is passed to the substream 
`collect` method as the `parent` argument.
+
+```python
+import pydevlake as dl
+import pydevlake.domain_layer.ticket as ticket
+
+from myplugin.streams.issues import Issues
+
+class Comments(dl.Substream):
+    tool_model = IssueComment
+    domain_models = [ticket.IssueComment]
+    parent_stream = Issues
+
+    def collect(self, state, context, parent: Issue) -> Iterable[Tuple[object, 
dict]]:
+        ...
+```
+
+
 ### Create an API wrapper
 
 Lets assume that your datasource is a REST API.
@@ -313,30 +465,6 @@ poetry run myplugin/main.py $CTX users 3>&1
 ```
 
 
-# Test the plugin with DevLake
-
-To test your plugin together with DevLake, you first need to create a 
connection for your plugin and get its id.
-One easy way to do that is to run DevLake with `make dev` and then to create 
the connection with a POST
-request to your plugin connection API:
-
-```console
-curl -X 'POST' \
-  'http://localhost:8080/plugins/myplugin/connections' \
-  -d '{...connection JSON object...}'
-```
-
-You should get the created connection with his id (which is 1 for the first 
created connection) in the response.
-
-Now that a connection for your plugin exists in DevLake database, we can try 
to run your plugin using `backend/server/services/remote/run/run.go` script:
-
-```console
-cd backend
-go run server/services/remote/run/run.go  -c 1 -p 
python/plugins/myplugin/myplugin/main.py
-```
-
-This script takes a connection id (`-c` flag) and the path to your plugin 
`main.py` file (`-p` flag).
-You can also send options as a JSON object (`-o` flag).
-
 # Automated tests
 Make sure you have unit-tests written for your plugin code. The test files 
should end with `_test.py`, and are discovered and
 executed by the `run_tests.sh` script by the CICD automation. The test files 
should be placed inside the plugin project directory.

[incubator-devlake] branch main updated: docs: Update README.md (#4952)

Reply via email to