gortiz opened a new issue, #14401: URL: https://github.com/apache/pinot/issues/14401
# Pinot dev docs I've recently read the [Velox Developer Guide](https://facebookincubator.github.io/velox/develop.html) and I really think that it would be super useful to have something like that. Quoting that page: > This guide is intended for Velox contributors and developers of Velox-based applications. That is exactly what I think we need for Pinot. We should have a developer guide that explains centralizes all the information that a developer needs to contribute to Pinot. That includes how to build Pinot, how to run Pinot, how to write tests, how to write documentation, etc. but also information on the design choices that were made, explaining key classes and concepts, etc. ## Current state We already have some developer documentation for Pinot, but it is spread across many different places and sometimes the information is outdated or incomplete. The most important source of developer documentation is the [Contribution Guidelines page in User documentation][user-contribution-guidelines]. This page is focused on the process of contributing to Pinot, but it also has some information on how to build Pinot. There are other sources of information like: * The [README.md](https://github.com/apache/pinot/blob/master/README.md#building-pinot) file on the GitHub repository. * Some PRs or GH issues that have some design discussions in the comments. * [The design docs in the user documentation](https://docs.pinot.apache.org/developers/design-documents), although that page wasn't updated since 2022. [user-contribution-guidelines]: https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines ## Developer documentation we need There are many things that we need to document for developers. For example, the following are questions I had to answer in the past months: * Explaining the datatypes in Pinot. How are they stored? How are they converted? For example see [types in Velox](https://facebookincubator.github.io/velox/develop/types.html). * Explaining type validation in Pinot. For example, MSQ does it in a different way than how SSQ does it. * Explaining implementation key differences between MSQ and SSQ. * Explaining how queries are parsed, validated, optimized, how a broker decides which server executes which parts of the query, how are these plans are sent to the servers etc. * Explaining how different join types are implemented in Pinot. * Explaining that queries need to deal with the fact that segments may not be refreshed and therefore contain different indexes than the ones indicated in the latest table config. Also explaining how we deal with that. I don't think we would need to write all this documentation from scratch and explain every detail. That could change very often and in the worst case it would end up being a translation of what the code does but in English. Instead I think we should explain the key points (ideally with diagrams) and refer to the important classes and methods in the codebase. ### Example: Timestamp indexes Here I'm going to write around a page of important information I learned about timestamp indexes in Pinot by solving issues and reading the code but I would have loved to have this information synthesized in a single place at the time. This is the kind of information our developers may need and the one we don't have an actual place for. <details> <summary>Click here to see the example</summary> Timestamp indexes are a key feature in Pinot, but they are very different from other indexes. Although they are called indexes in the user documentation, some committers call them "syntactic sugar" because they are not indexes in the codebase. Instead, when the user configures a timestamp _index_ in their TableConfig: 1. A new column is created for each cardinality of the timestamp _index_ (one for days, one for months, etc). 2. A range index is created for each of these columns. 3. Whenever a query is received, _the broker_ rewrites the query to use these columns instead of the original timestamp column if the query has a filter using one of the cardinalities. Some of these steps are described in the [timestamp page of the user documentation][timestamp-index], but not all of them. [timestamp-index]:https://docs.pinot.apache.org/basics/indexing/timestamp-index Specifically, there is one key point on timestamp _indexes_. All other column indexes optimize queries at the segment level (in the servers) by changing the way `FilterPlanNode` are transformed into different Operators (in `FilterPlanNode.constructPhysicalOperator`). Meanwhile, timestamp _indexes_ optimize queries at the broker level (as explained above). The broker analyzes the query to look for all usages of the original column that can be optimized. For example if there is a timestamp index on `event_time` that includes the `YEAR` granularity, the broker marks in the meta-information that any call to `dateTrunc('YEAR', event_time)` can be rewritten as `$event_time$YEAR`. Brokers do this `BaseSingleStageBrokerRequestHandler.handleExpressionOverride()`, setting the `expressionOverrideMap` attribute of `QueryConfig`. Then the server that receives the query verifies that the column `$event_time$YEAR` exists in the segment (remember that the segment may not be updated to the latest table config!) and if it does, it rewrites the query before `FilterPlanNode.constructPhysicalOperator` is called. Servers do this when building `TableCache.TableConfigInfo`, which obtains the information from `QueryConfig.getExpressionOverrideMap()` At least this is how it works in Single-stage query engine (SSQ). In Multi-stage query engine (MSQ) the broker doesn't rewrite the query and therefore timestamp indexes are not used (ie https://github.com/apache/pinot/pull/11409 tried to add support for it). How do we add these new cardinality columns? They are added in `TimestampIndexUtils.applyTimestampIndex`, which modifies the schema and the table config of the table. Do we store the modified schema and table config somewhere? No, it is not stored in Zookeeper nor in the segment metadata. Therefore it is very important for developers to know there are two kinds of Schema and TableConfig objects: * The ones that are not enriched. They are the ones that are persisted and shown to the users. * The ones that are used at runtime. They are the ones that have been enriched with the timestamp indexes. And given the typesystem doesn't help (we don't have EnrichedSchema and EnrichedTableConfig classes), developers need to know when they are working with one or the other. </details> ## Proposal I propose to create a new site for user documentation site. This site would be written using [MkDocs](https://www.mkdocs.org/) and the code would be stored in the `docs` folder of the Pinot GitHub repository. I already opened a https://github.com/apache/pinot/pull/14346 that includes all the machinery to build the site and a couple of (I hope useful) pages describing some key aspects of the lifecycle of a multi-stage query in Pinot. Having to have a new site for the developer documentation is not the only way to go. In fact, it may remind the famous XKCD comic about standards:  But there are reasons to think that the tools we use right now are not the best for the job: * The user documentation is written in GitBook, which is not the best tool to write developer documentation. Specifically, although GitBook supports markdown, it is biased to be used through the GitBook website, whose UI is very confusing (AFAIK no committer likes GitBook). * The user documentation is focused on how to use Pinot, while the developer documentation should be focused on how Pinot works. It is ok to have some overlap and to link between them, but it should be clear for readers which one they are reading. * The user documentation is external to the code repository, which makes it harder to keep it in sync with the codebase. For example, if a PR in the code changes a feature, it is harder for both the committer and the reviewers to remember to update the documentation in GitBook. * Google Drive is good to discuss design documents while there are being written, but it is not a good place to store them. It is hard to search, hard to link to, hard to keep in sync with the codebase, etc. This dev site written in MkDocs solves all this issues. Being written in markdown, it is easy to write and to review. Being hosted in the code repository, it is easy to keep in sync with the codebase. It is also very easy to publish this pages. There are a lot of information online on how to publish MkDocs pages in different places including GitHub Pages. The [Apache Foundation has a page](https://infra.apache.org/asfyaml-mkdocs.html) on how to publish MkDocs pages in the ASF infrastructure. ## Closing notes This is probably not a high priority task, but I think it is a very important one. I think that having a centralized place for developer documentation would make it easier for new contributors to start contributing to Pinot and for current contributors to understand the codebase better, which would make them more productive and reduce the number of bugs. Writing documentation is not a task that can be done in a single week, but instead it is a task that should be done incrementally and continuously. The easier the tools are to use, the more likely it is that the documentation will be written and maintained. I would like to hear your thoughts on this proposal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
