Hi Amila, Thanks for taking a look. No apologies necessary, I think I didn't cover enough background.
The term "data product" is borrowed from Airavata's replica catalog. [1] In Airavata, a data product is typically a user provided input file or an application generated output file. There are two types of data products, "FILE" and "COLLECTION", and data products can contain other data products, roughly mapping to a POSIX directory structure, but not limited to such. For Cybershuttle, we also need input and output file registration, but there are other sources of data that need to be captured, such as instrument generated data. A "data product" is distinct from its "replica locations", again borrowed from Airavata's replica catalog. The data catalog only knows about the metadata of a dataset, not where and how it could be accessed. > In relation to the "Data product", when we do "createMetadataSchema(..., name > = "smilesdb")" -- does this create a "data product" named "smilesdb" ? No, "createDataProduct(DataProduct dataProduct)" is used to create a data product. createMetadataSchema() creates a named reference to a schema for metadata. So in this example, "smilesdb" is a named metadata schema. A named metadata schema doesn't mean much until fields are added to it. Once fields are added to it, a data product can then be added to the metadata schema via addDataProductToMetadataSchema(). This indicates to the data catalog that a subset of this data product's metadata adheres to the "smilesdb" metadata schema. That is, for every field that is defined for the "smilesdb" metadata schema, this data product should have that field in its metadata. An example may be helpful. Let's say we create a metadata schema named "zenodo" [2] and we add some fields to it - one called "zenodo_doi" with a JSON path of "$.zenodo.doi". - one called "zenodo_published_date" with a JSON path of "$.zenodo.published-date". - etc. Then let's say we have some data products. Some of the data products have a zenodo field in their JSON metadata: { ..., "zenodo", { "doi": "...", "published-date": "..." }, ... } For such data products, we can call addDataProductToMetadataSchema() to add them to the "zenodo" metadata schema. Now, for the data products added to the "zenodo" metadata schema, the data catalog can support queries on the "zenodo_doi" and "zenodo_published_date" fields. We can also, for example, add an index on "zenodo_published_date" to support range queries. A data product's metadata might adhere to more than one metadata schema. Or none at all. > Also, I am curious to know what motivated you to keep the metadata as a json > (other than PG's json indexing) ? Different scientific domains have their own domain specific metadata that they want to be able to store and query on. So we need a schemaless mechanism for storing such metadata. For a relational database there are a couple of approach to storing such metadata. One would be with a table for holding key value pairs and another approach is to store a JSON document with the values. The JSON approach has the advantage that the metadata need not be flat but can be hierarchical. Also we want to support search over the data catalog's metadata and there are many good JSON document-oriented search solutions, such as MongoDB, Solr, Elasticsearch and of course PostgreSQL's JSON support also makes it possible to efficiently search over a JSON column there. > Will be great if you could convert the design document into a google doc -- > it is easy to provide feedback in the google doc format. That's funny because the attached PDF is an export from the google doc where I wrote the design document, so yeah it will be very easy to provide a google doc. I'll provide a link in a followup email. I only went with the approach of exporting it to PDF and discussing on the mailing list because that seems more in line with the Apache way of discussing things in the open on public mailing lists. If no one has a problem with it, we can move the feedback to the google doc. > When i try to access > "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio > ns/data/molecule.json", i get a 404 error. Sorry about that. Here's the URL: https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migrations/data/molecule.json Thanks again for taking time to critique this design, Amila. I appreciate your feedback. Thanks, Marcus [1] https://github.com/apache/airavata/blob/master/thrift-interface-descriptions/data-models/replica-catalog-models/replica_catalog_models.thrift#L58 [2] https://about.zenodo.org/ > On Jan 18, 2023, at 12:26 PM, Thejaka Amila J Kanewala > <thejaka.am...@gmail.com> wrote: > > You don't often get email from thejaka.am...@gmail.com. Learn why this is > important > Hi Marcus, > > Sorry for my lack of knowledge on this. > > Just for my understanding, could you please define what a "data product" is ? > -- I can see that from the schema diagram it has an id and has a parent-child > relationship, but I would like to understand functionally what a "data > product" is. > > In relation to the "Data product", when we do "createMetadataSchema(..., name > = "smilesdb")" -- does this create a "data product" named "smilesdb" ? > Also, I am curious to know what motivated you to keep the metadata as a json > (other than PG's json indexing) ? > > Cosmetic: > Will be great if you could convert the design document into a google doc -- > it is easy to provide feedback in the google doc format. > When i try to access > "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio > ns/data/molecule.json", i get a 404 error. > > Thanks. > Best Regards, > Thejaka Amila Kanewala, PhD > https://github.com/thejkane/agm > http://valagamba.net/ > > > On Tue, Jan 17, 2023 at 9:42 AM Christie, Marcus Aaron <machr...@iu.edu> > wrote: > Hi All, > > I've attached a design document for the search API of the redesigned Data > Catalog and I'm looking for some feedback on it. > > Some context: for the Cybershuttle project, we're creating a redesigned data > catalog to store metadata about directories and files that may come from > several sources: user provided, instrument generated, generated as output > from a computation. Metadata about these data products may also come from > several sources. > > The high-level requirements are: > > - support searching and filtering for data products using schemaless metadata > - supports bursts of writes, for example when scanning and registering all of > the files in a directory > - basic CRUD operations on data products, including bulk operations > - capture parent/child relationships (i.e., directory, sub-directory, file > relationships) and allow querying based on these relationships > - basic CRUD operations on data product's metadata > > Most of the basic CRUD operations are omitted from the design document. The > design document focuses on the search and querying API. > > This redesign builds on Airavata's Replica Catalog and the DRMS Resource > Service [1] in airavata-data-lake. > > The main difference in this design is that it uses the built-in JSON querying > and indexing capabilities of PostgreSQL. The goal is to support whatever > metadata is available as long as it is in JSON format and make it efficiently > searchable and filterable. Also, to make the API more developer friendly, the > API supports querying via SQL (which will be transformed to the actual > backend query using Apache Calcite). > > Your feedback is most welcome. > > Sincerely, > > Marcus > > > > [1] > https://github.com/apache/airavata-data-lake/blob/master/data-resource-management-service/drms-stubs/src/main/proto/resource/DRMSResourceService.proto >
smime.p7s
Description: S/MIME cryptographic signature