Hello all,
I am unsure of where to post my questions, so I do apologies for the long
message, and if anyone can point me to the right forum that would be great.
My questions are largely about schema, storage, and partitioning for Parquet
files and DuckDB.
I am building an institute-wide omics data visualisation platform (similar in
spirit to CellxGene) but designed to support independent research groups and am
considering a DuckLake-like approach.
Core stack:
*
Parquet for long-term immutable data storage
*
DuckDB as the query engine; also used to store project metadata
*
Arrow interop for zero-copy where possible (to be fed into R Shiny)
Datasets are contributed by many groups in mixed formats (e.g. csv, .rds, .h5ad
etc.) but I have a few constraints:
*
inconsistent schemas
*
large datasets (1,000s – 10,000s of genes, cells)
*
interactive queries that require filtering, sorting, subsetting etc. without
loading full data into memory
I plan on developing a conversion to Parquet/validation pipeline and have
considered the following tables:
*
genes.parquet # layered w/ proteins.parquet?
*
counts.parquet # counts matrices
*
expression.parquet # analysis (DEG) results
*
cells.parquet # chunked (e.g. cells 0–4999)
*
embeddings.parquet # multilayered with PCA, UMAP, t-SNE
*
dataset_metadata.parquet # unstructured, string format or store in duckdb?
*
qc.parquet # data summaries and aggregates?
*
gene_sets.parquet # optional?
*
QUERY_HASH.parquet # query results cache?
Questions:
1.
In this context, do users typically normalise data towards a specific structure
or allow arbitrary schemas and then map via DuckDB views or lookup
configs/semantic mapping layers?
1.
Since I will be aggregating/ precomputing data (created during data ingestion
e.g. QC summaries), is best to store them as Parquet or database tables for
Shiny to access it?
1.
Is it also best to store gene sets as cached queries, within the DB or as
separate Parquet?
1.
Given Parquet’s columnar nature:
*
Is it generally better to model expression matrices in long format rather than
wide format?
*
Are there recommended hybrid approaches (e.g. chunked blocks per gene or per
cell)?
*
How are multiple layers of the same data typically stored? (ie separate files,
columns or tables)
Given the interactive workloads, I am considering DuckDB’s hive-style
partitioning to improve performance by of partitioning the top-level folder to
encode query filters for dataset_id, modality/omic_type, contrast, organism.
Does this make logical sense? For example:
```
…/{DATASET_ID}/parquet_files/
└── organism=human/
└── modality=transcriptomics/
└── dataset_id=dataset_001/
└── contrast=treated_vs_control/
├── results/
│ ├── expression.parquet
│ ├── genes.parquet
```
Many thanks,
Dammy Shittu
Data Research Assistant
Core Informatics Team @ UK Dementia Research Institute (UK DRI)
---------------------------------------------------------------------
e: [email protected] | [email protected]
a: UK DRI, UCL Queen Square Institute of Neurology, Queen Square, London WC1N
3BG
w: https://www.ukdri.ac.uk/centres/ucl
As I work flexible hours, I do not expect replies outside of your own normal
working hours.