gortiz opened a new issue, #14401:
URL: https://github.com/apache/pinot/issues/14401

   # Pinot dev docs
   
   I've recently read the [Velox Developer 
Guide](https://facebookincubator.github.io/velox/develop.html) and I really 
   think that it would be super useful to have something like that.
   Quoting that page:
   
   > This guide is intended for Velox contributors and developers of 
Velox-based applications.
   
   That is exactly what I think we need for Pinot.
   We should have a developer guide that explains centralizes all the 
information that a developer needs to contribute to 
   Pinot.
   That includes how to build Pinot, how to run Pinot, how to write tests, how 
to write documentation, etc. but also
   information on the design choices that were made, explaining key classes and 
concepts, etc.
   
   ## Current state
   
   We already have some developer documentation for Pinot, but it is spread 
across many different places and sometimes
   the information is outdated or incomplete.
   
   The most important source of developer documentation is the 
   [Contribution Guidelines page in User 
documentation][user-contribution-guidelines].
   This page is focused on the process of contributing to Pinot, but it also 
has some information on how to build Pinot.
   
   There are other sources of information like:
     * The 
[README.md](https://github.com/apache/pinot/blob/master/README.md#building-pinot)
 file on the GitHub repository.
     * Some PRs or GH issues that have some design discussions in the comments.
     * [The design docs in the user 
documentation](https://docs.pinot.apache.org/developers/design-documents),
     although that page wasn't updated since 2022.
   
   [user-contribution-guidelines]: 
https://docs.pinot.apache.org/developers/developers-and-contributors/contribution-guidelines
   
   ## Developer documentation we need
   
   There are many things that we need to document for developers.
   For example, the following are questions I had to answer in the past months:
   
     * Explaining the datatypes in Pinot. How are they stored? How are they 
converted? For example see
       [types in 
Velox](https://facebookincubator.github.io/velox/develop/types.html).
     * Explaining type validation in Pinot. For example, MSQ does it in a 
different way than how SSQ does it.
     * Explaining implementation key differences between MSQ and SSQ.
     * Explaining how queries are parsed, validated, optimized, how a broker 
decides which server executes which parts
       of the query, how are these plans are sent to the servers etc.
     * Explaining how different join types are implemented in Pinot.
     * Explaining that queries need to deal with the fact that segments may not 
be refreshed and therefore contain 
       different indexes than the ones indicated in the latest table config. 
Also explaining how we deal with that.
   
   I don't think we would need to write all this documentation from scratch and 
explain every detail.
   That could change very often and in the worst case it would end up being a 
translation of what the code does but in
   English.
   Instead I think we should explain the key points (ideally with diagrams) and 
refer to the important classes and methods 
   in the codebase.
   
   ### Example: Timestamp indexes
   
   Here I'm going to write around a page of important information I learned 
about timestamp indexes in Pinot by solving
   issues and reading the code but I would have loved to have this information 
synthesized in a single place at the time.
   This is the kind of information our developers may need and the one we don't 
have an actual place for. 
   
   <details>
     <summary>Click here to see the example</summary>
   
   Timestamp indexes are a key feature in Pinot, but they are very different 
from other indexes.
   Although they are called indexes in the user documentation, some committers 
call them "syntactic sugar" because they
   are not indexes in the codebase.
   Instead, when the user configures a timestamp _index_ in their TableConfig:
     1. A new column is created for each cardinality of the timestamp _index_ 
(one for days, one for months, etc).
     2. A range index is created for each of these columns.
     3. Whenever a query is received, _the broker_ rewrites the query to use 
these columns instead of the original
        timestamp column if the query has a filter using one of the 
cardinalities.
   
   Some of these steps are described in the [timestamp page of the user 
documentation][timestamp-index], but not all of 
   them.
   
   
[timestamp-index]:https://docs.pinot.apache.org/basics/indexing/timestamp-index
   
   Specifically, there is one key point on timestamp _indexes_.
   All other column indexes optimize queries at the segment level (in the 
servers) by changing the way `FilterPlanNode` are
   transformed into different Operators (in 
`FilterPlanNode.constructPhysicalOperator`).
   Meanwhile, timestamp _indexes_ optimize queries at the broker level (as 
explained above).
   The broker analyzes the query to look for all usages of the original column 
that can be optimized.
   For example if there is a timestamp index on `event_time` that includes the 
`YEAR` granularity,
   the broker marks in the meta-information that any call to `dateTrunc('YEAR', 
event_time)` can be rewritten as
   `$event_time$YEAR`.
   Brokers do this 
`BaseSingleStageBrokerRequestHandler.handleExpressionOverride()`, setting the 
`expressionOverrideMap`
   attribute of `QueryConfig`.
   
   Then the server that receives the query verifies that the column 
`$event_time$YEAR` exists in the segment
   (remember that the segment may not be updated to the latest table config!) 
and if it does, it rewrites the query
   before `FilterPlanNode.constructPhysicalOperator` is called.
   Servers do this when building `TableCache.TableConfigInfo`, which obtains 
the information from 
   `QueryConfig.getExpressionOverrideMap()`
   
   At least this is how it works in Single-stage query engine (SSQ).
   In Multi-stage query engine (MSQ) the broker doesn't rewrite the query and 
therefore timestamp indexes are not used
   (ie https://github.com/apache/pinot/pull/11409 tried to add support for it).
   
   How do we add these new cardinality columns?
   They are added in `TimestampIndexUtils.applyTimestampIndex`, which modifies 
the schema and the table config of the 
   table.
   
   Do we store the modified schema and table config somewhere?
   No, it is not stored in Zookeeper nor in the segment metadata.
   Therefore it is very important for developers to know there are two kinds of 
Schema and TableConfig objects:
   
     * The ones that are not enriched. They are the ones that are persisted and 
shown to the users.
     * The ones that are used at runtime. They are the ones that have been 
enriched with the timestamp indexes.
   
   And given the typesystem doesn't help (we don't have EnrichedSchema and 
EnrichedTableConfig classes), developers
   need to know when they are working with one or the other.
   
   </details>
   
   ## Proposal
   
   I propose to create a new site for user documentation site.
   This site would be written using [MkDocs](https://www.mkdocs.org/) and the 
code would be stored in the `docs` folder of 
   the Pinot GitHub repository.
   
   I already opened a https://github.com/apache/pinot/pull/14346 that includes 
all the machinery to build the site and
   a couple of (I hope useful) pages describing some key aspects of the 
lifecycle of a multi-stage query in Pinot.
   
   Having to have a new site for the developer documentation is not the only 
way to go.
   In fact, it may remind the famous XKCD comic about standards:
   
   ![a new standard!](https://imgs.xkcd.com/comics/standards.png)
   
   But there are reasons to think that the tools we use right now are not the 
best for the job:
     * The user documentation is written in GitBook, which is not the best tool 
to write developer documentation.
       Specifically, although GitBook supports markdown, it is biased to be 
used through the GitBook website, whose UI
       is very confusing (AFAIK no committer likes GitBook).
     * The user documentation is focused on how to use Pinot, while the 
developer documentation should be focused on how
       Pinot works. It is ok to have some overlap and to link between them, but 
it should be clear for readers which one
       they are reading.
     * The user documentation is external to the code repository, which makes 
it harder to keep it in sync with the 
       codebase. For example, if a PR in the code changes a feature, it is 
harder for both the committer and the reviewers
       to remember to update the documentation in GitBook.
     * Google Drive is good to discuss design documents while there are being 
written, but it is not a good place to
       store them. It is hard to search, hard to link to, hard to keep in sync 
with the codebase, etc.
   
   This dev site written in MkDocs solves all this issues. 
   Being written in markdown, it is easy to write and to review.
   Being hosted in the code repository, it is easy to keep in sync with the 
codebase.
   It is also very easy to publish this pages.
   There are a lot of information online on how to publish MkDocs pages in 
different places including GitHub Pages.
   The [Apache Foundation has a 
page](https://infra.apache.org/asfyaml-mkdocs.html) on how to publish MkDocs 
pages in the
   ASF infrastructure.
   
   ## Closing notes
   
   This is probably not a high priority task, but I think it is a very 
important one.
   I think that having a centralized place for developer documentation would 
make it easier for new contributors to start
   contributing to Pinot and for current contributors to understand the 
codebase better, which would make them more
   productive and reduce the number of bugs.
   
   Writing documentation is not a task that can be done in a single week, but 
instead it is a task that should be done
   incrementally and continuously.
   The easier the tools are to use, the more likely it is that the 
documentation will be written and maintained.
   
   I would like to hear your thoughts on this proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to