Summary: An interface to vector databases for embeddings Requires: emacs-29.1, plz-0.8, pg-0.56 Website: https://github.com/ahyatt/vecdb Maintainer: Andrew Hyatt <ahy...@gmail.com> Author: Andrew Hyatt <ahy...@gmail.com>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ VECDB: VECTOR SEARCH LIBRARY FOR EMACS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1 Introduction ══════════════ The `vecdb' package provides an interface to a vector database, where vectors are vecdbdings representing pieces of text. These databases enable "semantic search", which is a powerful way to search over meaning. This kind of search needs specialized storage and retrieval. This package doesn't provide end-user functionality on its own; it is designed to be used in other packages that need semantic search. The package does not provide vecdbdings, that can be done with the [llm] package, or any source of vecdbdings. [llm] <https://github.com/ahyatt/llm> 2 Configuring the collection ════════════════════════════ There are two concepts that together define a collection database of vecdbdings: the /provider/, and the /collection/. The provider is what kind of backend we are using, right now either `chroma', or `qdrant'. This is a struct defined by the exact provider you want to use. The collection is, for that provider, what exact database is getting used, with each collection having its own separate data. Collections must be created before being used. The collection is defined by the struct `vecdb-collection' which has a `name' (used to identify the collection), `vector-size', and `payload-fields'. The `vector-size' will be based on the size of the vecdbding vector from your provider. 1536 is what Open AI uses. `payload-fields' is an alist of fields and their types that defines other data fields that are inserted and retrieved when search happens, and can be queried on as well (eventually, in a future iteration of the package). An example, putting it all together, is: ┌──── │ (defvar my-vecdb-provider (make-vecdb-provider :api-key my-qdrant-api-key :url my-qdrant-url)) │ (defvar my-vecdb-collection (make-vecdb-collection :name "my test collection" :vector-size 1536 :payload-fields (('my-id . 'string)))) └──── The provider will be supplied by the end-user, specifying how they want things stored, and any data necessary for that storage and retrieval to function. The collection is typically partially supplied by the application, with the possible exception of vecdbding size, which may be dependent on the exact vecdbding provider they are using. Collections must be created before they can be used with `vecdb-create', and `vecdb-exists' can return whether the collection exists. ┌──── │ (unless (vecdb-exists my-vecdb-provider my-vecdb-collection) │ (vecdb-create my-vecdb-provider my-vecdb-collection)) └──── They can also be deleted with `vecdb-delete'. 3 Adding and replacing data ═══════════════════════════ Before data is queried, it must be added. This is done via a batch operation on a group of data, `vecdb-upsert-items'. This either creates an item in the collection or replaces it, based on the `id' of the item. `id' should be an integer or a string UUID. Seeing as how emacs does not provide a UUID library, probably an integer is the best choice. ┌──── │ (vecdb-upsert-items my-vecdb-provider my-vecdb-collection │ (list (make-vecdb-item │ :id 91 │ :vector [0.1 0.2 0.3 0.4] │ :payload '(:my-id "235913926")))) └──── These can be deleted with `vecdb-delete-item' and retrieved by ID with `vecdb-get-item'. IDs used in `vecdb' *must* be `uint64' values. If you have another ID you need to use to tie it together with other storage, that should go into the `payload'. Also, each item passed in must set the same payload fields. 4 Querying data ═══════════════ Querying the database can be done with `vecdb-search-by-vector', passing it a vector and optionally a number of results to return (10 is the default). ┌──── │ (vecdb-search-by-vector my-vecdb-provider my-vecdb-collection [0.3 0.1 0.5 -0.9] 20) └──── This will return the specifies number of `vecdb-item' structs, with the payloads they were stored with. 5 Providers ═══════════ 5.1 qdrant ────────── [qdrant] is an open source vector database that concentrates mostly on running in the cloud, but can be run locally with a docker container. They provide a free tier for your database in the cloud that may be garbage collected after a period of inactivity. A qdrant provider is defined like: ┌──── │ (defvar my-vecdb-provider (make-vecdb-qdrant-provider :api-key my-qdrant-api-key :url my-qdrant-url)) └──── Substitute `my-qdrant-api-key' with your key, and `my-qdrant-url' is the URL of the server that is used to serve your data. This will be unique to your collection in the cloud, or a local URL for docker. [qdrant] <https://qdrant.tech/> 5.2 chroma ────────── [chroma] is an open source Python-centric vector database. It can run as a server locally, or offers paid services to host in the cloud. Currently this library only supports local running. If running locally, before use, you must run `chroma run' to start the server. The chroma provider has two additional divisions of data above the collection, and these are specified in the provider itself: the /tenant/ and the /database/. These will both default to `"default"', but can be specifed. Because the chroma provider is local, my default, no configuration is needed: ┌──── │ (defvar my-chroma-provider (make-vecdb-chroma-provider)) └──── However, the full set of options, here demonstrating the equivalent settings to the defaults are: ┌──── │ (defvar my-chroma-provider (make-vecdb-chroma-provider │ :binary "chroma" │ :url "http://localhost:8000" │ :tenant "default" │ :database "default")) └──── [chroma] <https://www.trychroma.com/> 5.3 Postgres with pgvector ────────────────────────── The popular database Postgres has an extension that allows it to have vector database functionality, [pgvector]. This needs the `pg-el' library. A provider defines a database, and the collection will define a table with the collection name in that database. For example, ┌──── │ (defvar my-postgres-provider (make-vecdb-psqlprovider :dbname "mydatabase" :username "myuser")) └──── This also takes an optional password as well. For now, this just uses localhost as a default. [pgvector] <https://github.com/pgvector/pgvector>