aplex commented on a change in pull request #3178:
URL: https://github.com/apache/incubator-gobblin/pull/3178#discussion_r551674731
##########
File path:
gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
package org.apache.gobblin.dataset;
-import java.util.Map;
-
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.Maps;
-
import lombok.Getter;
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
/**
* A {@link Descriptor} identifies and provides metadata to describe a dataset
*/
public class DatasetDescriptor extends Descriptor {
private static final String PLATFORM_KEY = "platform";
private static final String NAME_KEY = "name";
+ private static final String CLUSTER_NAME_KEY = "clusterName";
/**
* which platform the dataset is stored, for example: local, hdfs, oracle,
mysql, kafka
*/
@Getter
private final String platform;
+ /**
+ * Human-readeable cluster name.
+ *
+ * @see org.apache.gobblin.util.ClustersNames
+ */
+ @Getter
+ @Nullable
+ private final String clusterName;
Review comment:
Initially, I was thinking about putting a well-known cluster name like
"snap" into this field, and for unknown clusters put a DNS name like
"datalake.corp.com". Then lineage events will include this name as-is, and it
can be displayed to users directly. We already use "cluster name" concept in
org.apache.gobblin.util.ClustersNames, for getting nice names from yarn/hdfs
urls.
However, this well-know name is not easily convertible back to domain name.
Single well-known name can map to different urls for hdfs/hive. So, there could
be a benefit in keeping a hostname instead of clustername, and converting it to
human-readable name before reporting lineage.
DatasetDescriptor already looks like a dataset URL with non-standard field
names:
- platform is like url.protocol - "hdfs", "kafka", "mysql"
- name is like url.path - "table1" or "/data/tracking/test123"
- clusterName is like a url.host - "my-cluster" or "my-cluster.company.com"
Regarding the fields you suggested:
- I don't think we have availability zone information anywhere.
- We can get well-known nice name from hostname using
org.apache.gobblin.util.ClustersNames , so don't have to store it separately.
- Keeping port information is discussable. I would assume that any
production deployments will have separate DNS names for different clusters.
However, for small scale, temporary deployments it's possible to have multiple
Hives or namenodes on a single host, running on different ports.
##########
File path:
gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
package org.apache.gobblin.dataset;
-import java.util.Map;
-
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.Maps;
-
import lombok.Getter;
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
/**
* A {@link Descriptor} identifies and provides metadata to describe a dataset
*/
public class DatasetDescriptor extends Descriptor {
Review comment:
I think inside Gobblin codebase, we work only with dataset instances.
When we copy or ingest stuff, we always do it on a specific hdfs/kafka/other
cluster, rather than on abstract multi-cluster dataset.
In practice, the DatasetDescriptor is used for lineage reporting.
It's not mandatory for backward-compatibility. In the OSS code base, there
are a bunch of Salesforce/MySql connectors that don't report cluster names at
the moment, although they can do it as well. There also could be usages in
other libraries.
I think in the final state, cluster name should be mandatory in this class.
For transitional period, we can get there gradually with marking old
constructor obsolete and clustername optional.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]