[I] Pinot dev docs [pinot]

via GitHub Wed, 06 Nov 2024 06:20:25 -0800


gortiz opened a new issue, #14401:
URL: https://github.com/apache/pinot/issues/14401

# Pinot dev docs

I've recently read the [Velox Developer
Guide](https://facebookincubator.github.io/velox/develop.html) and I really
think that it would be super useful to have something like that.
Quoting that page:

> This guide is intended for Velox contributors and developers of
Velox-based applications.

That is exactly what I think we need for Pinot.
We should have a developer guide that explains centralizes all the
information that a developer needs to contribute to
Pinot.
That includes how to build Pinot, how to run Pinot, how to write tests, how
to write documentation, etc. but also
information on the design choices that were made, explaining key classes and
concepts, etc.

## Current state

We already have some developer documentation for Pinot, but it is spread
across many different places and sometimes
the information is outdated or incomplete.

The most important source of developer documentation is the
[Contribution Guidelines page in User
documentation][user-contribution-guidelines].
This page is focused on the process of contributing to Pinot, but it also
has some information on how to build Pinot.

There are other sources of information like:
* The
[README.md](https://github.com/apache/pinot/blob/master/README.md#building-pinot)
file on the GitHub repository.
* Some PRs or GH issues that have some design discussions in the comments.
* [The design docs in the user
documentation](https://docs.pinot.apache.org/developers/design-documents),
although that page wasn't updated since 2022.

[user-contribution-guidelines]:
https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines

## Developer documentation we need

There are many things that we need to document for developers.
For example, the following are questions I had to answer in the past months:

* Explaining the datatypes in Pinot. How are they stored? How are they
converted? For example see
[types in
Velox](https://facebookincubator.github.io/velox/develop/types.html).
* Explaining type validation in Pinot. For example, MSQ does it in a
different way than how SSQ does it.
* Explaining implementation key differences between MSQ and SSQ.
* Explaining how queries are parsed, validated, optimized, how a broker
decides which server executes which parts
of the query, how are these plans are sent to the servers etc.
* Explaining how different join types are implemented in Pinot.
* Explaining that queries need to deal with the fact that segments may not
be refreshed and therefore contain
different indexes than the ones indicated in the latest table config.
Also explaining how we deal with that.

I don't think we would need to write all this documentation from scratch and
explain every detail.
That could change very often and in the worst case it would end up being a
translation of what the code does but in
English.
Instead I think we should explain the key points (ideally with diagrams) and
refer to the important classes and methods
in the codebase.

### Example: Timestamp indexes

Here I'm going to write around a page of important information I learned
about timestamp indexes in Pinot by solving
issues and reading the code but I would have loved to have this information
synthesized in a single place at the time.
This is the kind of information our developers may need and the one we don't
have an actual place for.

<details>
<summary>Click here to see the example</summary>

Timestamp indexes are a key feature in Pinot, but they are very different
from other indexes.
Although they are called indexes in the user documentation, some committers
call them "syntactic sugar" because they
are not indexes in the codebase.
Instead, when the user configures a timestamp _index_ in their TableConfig:
1. A new column is created for each cardinality of the timestamp _index_
(one for days, one for months, etc).
2. A range index is created for each of these columns.
3. Whenever a query is received, _the broker_ rewrites the query to use
these columns instead of the original
timestamp column if the query has a filter using one of the
cardinalities.

Some of these steps are described in the [timestamp page of the user
documentation][timestamp-index], but not all of
them.

[timestamp-index]:https://docs.pinot.apache.org/basics/indexing/timestamp-index

Specifically, there is one key point on timestamp _indexes_.
All other column indexes optimize queries at the segment level (in the
servers) by changing the way `FilterPlanNode` are
transformed into different Operators (in
`FilterPlanNode.constructPhysicalOperator`).
Meanwhile, timestamp _indexes_ optimize queries at the broker level (as
explained above).
The broker analyzes the query to look for all usages of the original column
that can be optimized.
For example if there is a timestamp index on `event_time` that includes the
`YEAR` granularity,
the broker marks in the meta-information that any call to `dateTrunc('YEAR',
event_time)` can be rewritten as
`$event_time$YEAR`.
Brokers do this
`BaseSingleStageBrokerRequestHandler.handleExpressionOverride()`, setting the
`expressionOverrideMap`
attribute of `QueryConfig`.

Then the server that receives the query verifies that the column
`$event_time$YEAR` exists in the segment
(remember that the segment may not be updated to the latest table config!)
and if it does, it rewrites the query
before `FilterPlanNode.constructPhysicalOperator` is called.
Servers do this when building `TableCache.TableConfigInfo`, which obtains
the information from
`QueryConfig.getExpressionOverrideMap()`

At least this is how it works in Single-stage query engine (SSQ).
In Multi-stage query engine (MSQ) the broker doesn't rewrite the query and
therefore timestamp indexes are not used
(ie https://github.com/apache/pinot/pull/11409 tried to add support for it).

How do we add these new cardinality columns?
They are added in `TimestampIndexUtils.applyTimestampIndex`, which modifies
the schema and the table config of the
table.

Do we store the modified schema and table config somewhere?
No, it is not stored in Zookeeper nor in the segment metadata.
Therefore it is very important for developers to know there are two kinds of
Schema and TableConfig objects:

* The ones that are not enriched. They are the ones that are persisted and
shown to the users.
* The ones that are used at runtime. They are the ones that have been
enriched with the timestamp indexes.

And given the typesystem doesn't help (we don't have EnrichedSchema and
EnrichedTableConfig classes), developers
need to know when they are working with one or the other.

</details>

## Proposal

I propose to create a new site for user documentation site.
This site would be written using [MkDocs](https://www.mkdocs.org/) and the
code would be stored in the `docs` folder of
the Pinot GitHub repository.

I already opened a https://github.com/apache/pinot/pull/14346 that includes
all the machinery to build the site and
a couple of (I hope useful) pages describing some key aspects of the
lifecycle of a multi-stage query in Pinot.

Having to have a new site for the developer documentation is not the only
way to go.
In fact, it may remind the famous XKCD comic about standards:

![a new standard!](https://imgs.xkcd.com/comics/standards.png)

But there are reasons to think that the tools we use right now are not the
best for the job:
* The user documentation is written in GitBook, which is not the best tool
to write developer documentation.
Specifically, although GitBook supports markdown, it is biased to be
used through the GitBook website, whose UI
is very confusing (AFAIK no committer likes GitBook).
* The user documentation is focused on how to use Pinot, while the
developer documentation should be focused on how
Pinot works. It is ok to have some overlap and to link between them, but
it should be clear for readers which one
they are reading.
* The user documentation is external to the code repository, which makes
it harder to keep it in sync with the
codebase. For example, if a PR in the code changes a feature, it is
harder for both the committer and the reviewers
to remember to update the documentation in GitBook.
* Google Drive is good to discuss design documents while there are being
written, but it is not a good place to
store them. It is hard to search, hard to link to, hard to keep in sync
with the codebase, etc.

This dev site written in MkDocs solves all this issues.
Being written in markdown, it is easy to write and to review.
Being hosted in the code repository, it is easy to keep in sync with the
codebase.
It is also very easy to publish this pages.
There are a lot of information online on how to publish MkDocs pages in
different places including GitHub Pages.
The [Apache Foundation has a
page](https://infra.apache.org/asfyaml-mkdocs.html) on how to publish MkDocs
pages in the
ASF infrastructure.

## Closing notes

This is probably not a high priority task, but I think it is a very
important one.
I think that having a centralized place for developer documentation would
make it easier for new contributors to start
contributing to Pinot and for current contributors to understand the
codebase better, which would make them more
productive and reduce the number of bugs.

Writing documentation is not a task that can be done in a single week, but
instead it is a task that should be done
incrementally and continuously.
The easier the tools are to use, the more likely it is that the
documentation will be written and maintained.

I would like to hear your thoughts on this proposal.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Pinot dev docs [pinot]

Reply via email to