Re: Data Catalog Search API

2023-01-19 Thread Christie, Marcus Aaron


> On Jan 18, 2023, at 12:26 PM, Thejaka Amila J Kanewala 
>  wrote:
> 
> Will be great if you could convert the design document into a google doc -- 
> it is easy to provide feedback in the google doc format.
> 

And here's the google doc link: 
https://docs.google.com/document/d/1itK9yGKlr1RSvCLyeujGM9Mts90h0TkUPPeIB5HsMT0/edit?usp=sharing

smime.p7s
Description: S/MIME cryptographic signature


Re: Data Catalog Search API

2023-01-19 Thread Christie, Marcus Aaron

Hi Amila,

Thanks for taking a look. No apologies necessary, I think I didn't cover enough 
background.

The term "data product" is borrowed from Airavata's replica catalog. [1]  In 
Airavata, a data product is typically a user provided input file or an 
application generated output file. There are two types of data products, "FILE" 
and "COLLECTION", and data products can contain other data products, roughly 
mapping to a POSIX directory structure, but not limited to such. For 
Cybershuttle, we also need input and output file registration, but there are 
other sources of data that need to be captured, such as instrument generated 
data.

A "data product" is distinct from its "replica locations", again borrowed from 
Airavata's replica catalog.  The data catalog only knows about the metadata of 
a dataset, not where and how it could be accessed.

> In relation to the "Data product", when we do "createMetadataSchema(..., name 
> = "smilesdb")" -- does this create a "data product" named "smilesdb" ?

No, "createDataProduct(DataProduct dataProduct)" is used to create a data 
product.  createMetadataSchema() creates a named reference to a schema for 
metadata. So in this example, "smilesdb" is a named metadata schema. A named 
metadata schema doesn't mean much until fields are added to it. Once fields are 
added to it, a data product can then be added to the metadata schema via 
addDataProductToMetadataSchema(). This indicates to the data catalog that a 
subset of this data product's metadata adheres to the "smilesdb" metadata 
schema. That is, for every field that is defined for the "smilesdb" metadata 
schema, this data product should have that field in its metadata.

An example may be helpful. Let's say we create a metadata schema named "zenodo" 
[2] and we add some fields to it

- one called "zenodo_doi" with a JSON path of "$.zenodo.doi".
- one called "zenodo_published_date" with a JSON path of 
"$.zenodo.published-date".
- etc.

Then let's say we have some data products. Some of the data products have a 
zenodo field in their JSON metadata:

{
  ...,
  "zenodo", {
"doi": "...",
"published-date": "..."
  },
  ...
}

For such data products, we can call addDataProductToMetadataSchema() to add 
them to the "zenodo" metadata schema. Now, for the data products added to the 
"zenodo" metadata schema, the data catalog can support queries on the 
"zenodo_doi" and "zenodo_published_date" fields. We can also, for example, add 
an index on "zenodo_published_date" to support range queries.

A data product's metadata might adhere to more than one metadata schema. Or 
none at all.

> Also, I am curious to know what motivated you to keep the metadata as a json 
> (other than PG's json indexing) ?

Different scientific domains have their own domain specific metadata that they 
want to be able to store and query on. So we need a schemaless mechanism for 
storing such metadata.  For a relational database there are a couple of 
approach to storing such metadata. One would be with a table for holding key 
value pairs and another approach is to store a JSON document with the values. 
The JSON approach has the advantage that the metadata need not be flat but can 
be hierarchical. Also we want to support search over the data catalog's 
metadata and there are many good JSON document-oriented search solutions, such 
as MongoDB, Solr, Elasticsearch and of course PostgreSQL's JSON support also 
makes it possible to efficiently search over a JSON column there.

> Will be great if you could convert the design document into a google doc -- 
> it is easy to provide feedback in the google doc format.

That's funny because the attached PDF is an export from the google doc where I 
wrote the design document, so yeah it will be very easy to provide a google 
doc. I'll provide a link in a followup email. I only went with the approach of 
exporting it to PDF and discussing on the mailing list because that seems more 
in line with the Apache way of discussing things in the open on public mailing 
lists. If no one has a problem with it, we can move the feedback to the google 
doc.


> When i try to access 
> "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
> ns/data/molecule.json", i get a 404 error.


Sorry about that. Here's the URL: 
https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migrations/data/molecule.json

Thanks again for taking time to critique this design, Amila. I appreciate your 
feedback.

Thanks,

Marcus


[1] 
https://github.com/apache/airavata/blob/master/thrift-interface-descriptions/data-models/replica-catalog-models/replica_catalog_models.thrift#L58
[2] https://about.zenodo.org/

> On Jan 18, 2023, at 12:26 PM, Thejaka Amila J Kanewala 
>  wrote:
> 
> You don't often get email from thejaka.am...@gmail.com. Learn why this is 
> important
> Hi Marcus,
> 
> Sorry for my lack of knowledge on this.
> 
> Just for my understanding, could you