Hello all,

I am unsure of where to post my questions, so I do apologies for the long 
message, and if anyone can point me to the right forum that would be great.
My questions are largely about schema, storage, and partitioning for Parquet 
files and DuckDB.

I am building an institute-wide omics data visualisation platform (similar in 
spirit to CellxGene) but designed to support independent research groups and am 
considering a DuckLake-like approach.

Core stack:

  *
Parquet for long-term immutable data storage
  *
DuckDB as the query engine; also used to store project metadata
  *
Arrow interop for zero-copy where possible (to be fed into R Shiny)

Datasets are contributed by many groups in mixed formats (e.g. csv, .rds, .h5ad 
etc.) but I have a few constraints:

  *
inconsistent schemas
  *
large datasets (1,000s – 10,000s of genes, cells)
  *
interactive queries that require filtering, sorting, subsetting etc. without 
loading full data into memory

I plan on developing a conversion to Parquet/validation pipeline and have 
considered the following tables:

  *
genes.parquet               # layered w/ proteins.parquet?
  *
counts.parquet         # counts matrices
  *
expression.parquet     # analysis (DEG) results
  *
cells.parquet                # chunked (e.g. cells 0–4999)
  *
embeddings.parquet    # multilayered with PCA, UMAP, t-SNE
  *
dataset_metadata.parquet   # unstructured, string format or store in duckdb?
  *
qc.parquet             # data summaries and aggregates?
  *
gene_sets.parquet        # optional?
  *
QUERY_HASH.parquet    # query results cache?

Questions:

  1.
In this context, do users typically normalise data towards a specific structure 
or allow arbitrary schemas and then map via DuckDB views or lookup 
configs/semantic mapping layers?


  1.
Since I will be aggregating/ precomputing data (created during data ingestion 
e.g. QC summaries), is best to store them as Parquet or database tables for 
Shiny to access it?


  1.
Is it also best to store gene sets as cached queries, within the DB or as 
separate Parquet?


  1.
Given Parquet’s columnar nature:

  *
Is it generally better to model expression matrices in long format rather than 
wide format?
  *
Are there recommended hybrid approaches (e.g. chunked blocks per gene or per 
cell)?
  *
How are multiple layers of the same data typically stored? (ie separate files, 
columns or tables)

Given the interactive workloads, I am considering DuckDB’s hive-style 
partitioning to improve performance by of partitioning the top-level folder to 
encode query filters for dataset_id, modality/omic_type, contrast, organism. 
Does this make logical sense? For example:

```
…/{DATASET_ID}/parquet_files/
       └── organism=human/
        └── modality=transcriptomics/
         └── dataset_id=dataset_001/
              └── contrast=treated_vs_control/
                   ├── results/
                   │   ├── expression.parquet
                   │   ├── genes.parquet
```

Many thanks,

Dammy Shittu



Data Research Assistant

Core Informatics Team @ UK Dementia Research Institute (UK DRI)

---------------------------------------------------------------------

e: [email protected]    |     [email protected]

a: UK DRI, UCL Queen Square Institute of Neurology, Queen Square, London WC1N 
3BG

w: https://www.ukdri.ac.uk/centres/ucl



As I work flexible hours, I do not expect replies outside of your own normal 
working hours.

Reply via email to