[
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767264#comment-16767264
]
Francois Saint-Jacques edited comment on ARROW-4333 at 2/3/20 6:28 PM:
-----------------------------------------------------------------------
References I suggest you to check.
- [The Design and Implementation of Modern Column-Oriented Database Systems,
2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long
read, but worth the investment. Will give you a broad overview of what are
columnar databases and what makes them fast.
- [MonetDB/X100: Hyper-Pipelining Query Execution,
2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]
foundation paper on how to implement fast query engine for analytics on modern
hardware.
- [Everything You Always Wanted to Know About Compiled and Vectorized Queries
But Were Afraid to Ask,
2018|http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf]: This is an update to
the paper you linked, which studies Compilation vs Vectorization.
- [Vectorization vs. Compilation in Query Execution,
2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
- [Relaxed Operator Fusion for In-Memory Databases: Making Compilation,
Vectorization, and Prefetching Work Together At Last,
2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
- [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence
in Compiled Query Pipelines,
2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using
AVX-512 masked instructions.
Anything by:
[Daniel J.
Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of
the team that wrote CStore which became Vertica. Write less about columnar
execution in the last years.
[Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211]
behind MonetDB/Vectorwize.
[Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846]
behind [Hyper|https://hyper-db.de/] bought by Tableau.
[Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210]
teaches database course at CMU
Amazing video lectures of courses at CMU, you can ignore most of the storage
layer, concurrency, transaction context. We're interested in execution engine,
vectorization, and compilation.
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]
The [CMU 15-721 Advanced Database Systems
schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually
a good source of papers.
Relevant code base:
- [Impala|https://github.com/apache/impala/tree/master/be/src]
- [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]
- [MapD|https://github.com/omnisci/mapd-core/tree/master]
- [Supersonic|https://github.com/google/supersonic]
- [Peloton|https://github.com/cmu-db/peloton]
was (Author: fsaintjacques):
References I suggest you to check.
- [The Design and Implementation of Modern Column-Oriented Database Systems,
2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long
read, but worth the investment. Will give you a broad overview of what are
columnar databases and what makes them fast.
- [MonetDB/X100: Hyper-Pipelining Query Execution,
2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]
foundation paper on how to implement fast query engine for analytics on modern
hardware.
- [Everything You Always Wanted to Know About Compiled and Vectorized Queries
But Were Afraid to Ask,
2018|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]:
This is an update to the paper you linked, which studies Compilation vs
Vectorization.
- [Vectorization vs. Compilation in Query Execution,
2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
- [Relaxed Operator Fusion for In-Memory Databases: Making Compilation,
Vectorization, and Prefetching Work Together At Last,
2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
- [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence
in Compiled Query Pipelines,
2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using
AVX-512 masked instructions.
Anything by:
[Daniel J.
Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of
the team that wrote CStore which became Vertica. Write less about columnar
execution in the last years.
[Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211]
behind MonetDB/Vectorwize.
[Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846]
behind [Hyper|https://hyper-db.de/] bought by Tableau.
[Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210]
teaches database course at CMU
Amazing video lectures of courses at CMU, you can ignore most of the storage
layer, concurrency, transaction context. We're interested in execution engine,
vectorization, and compilation.
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
* [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]
The [CMU 15-721 Advanced Database Systems
schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually
a good source of papers.
Relevant code base:
- [Impala|https://github.com/apache/impala/tree/master/be/src]
- [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]
- [MapD|https://github.com/omnisci/mapd-core/tree/master]
- [Supersonic|https://github.com/google/supersonic]
- [Peloton|https://github.com/cmu-db/peloton]
> [C++] Sketch out design for kernels and "query" execution in compute layer
> --------------------------------------------------------------------------
>
> Key: ARROW-4333
> URL: https://issues.apache.org/jira/browse/ARROW-4333
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Micah Kornfield
> Priority: Major
> Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
> * Thread safety of kernels?
> * When Kernels should allocate memory vs expect preallocated memory? How to
> communicate requirements for a kernels memory allocaiton?
> * How to communicate the whether a kernels execution is parallelizable
> across a ChunkedArray? How to determine if the order to execution across a
> ChunkedArray is important?
> * How to communicate when it is safe to re-use the same buffers and input
> and output to the same kernel?
> What does the threading model look like for the higher level of control?
> Where should synchronization happen?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)