[jira] [Comment Edited] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

Francois Saint-Jacques (Jira) Mon, 03 Feb 2020 10:29:15 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767264#comment-16767264
 ]


Francois Saint-Jacques edited comment on ARROW-4333 at 2/3/20 6:28 PM:
-----------------------------------------------------------------------

References I suggest you to check.
 - [The Design and Implementation of Modern Column-Oriented Database Systems, 
2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long 
read, but worth the investment. Will give you a broad overview of what are 
columnar databases and what makes them fast.
 - [MonetDB/X100: Hyper-Pipelining Query Execution, 
2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]
 foundation paper on how to implement fast query engine for analytics on modern 
hardware.
 - [Everything You Always Wanted to Know About Compiled and Vectorized Queries 
But Were Afraid to Ask, 
2018|http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf]: This is an update to 
the paper you linked, which studies Compilation vs Vectorization.
 - [Vectorization vs. Compilation in Query Execution, 
2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
 - [Relaxed Operator Fusion for In-Memory Databases: Making Compilation, 
Vectorization, and Prefetching Work Together At Last, 
2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
 - [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence 
in Compiled Query Pipelines, 
2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using 
AVX-512 masked instructions.

Anything by:
 [Daniel J. 
Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of 
the team that wrote CStore which became Vertica. Write less about columnar 
execution in the last years.
 [Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211] 
behind MonetDB/Vectorwize.
 [Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846] 
behind [Hyper|https://hyper-db.de/] bought by Tableau.
 [Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210] 
teaches database course at CMU

Amazing video lectures of courses at CMU, you can ignore most of the storage 
layer, concurrency, transaction context. We're interested in execution engine, 
vectorization, and compilation.
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]

The [CMU 15-721 Advanced Database Systems 
schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually 
a good source of papers.

Relevant code base:
 - [Impala|https://github.com/apache/impala/tree/master/be/src]
 - [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]  
 - [MapD|https://github.com/omnisci/mapd-core/tree/master]
 - [Supersonic|https://github.com/google/supersonic]
 - [Peloton|https://github.com/cmu-db/peloton]


was (Author: fsaintjacques):
References I suggest you to check.
 - [The Design and Implementation of Modern Column-Oriented Database Systems, 
2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long 
read, but worth the investment. Will give you a broad overview of what are 
columnar databases and what makes them fast.
 - [MonetDB/X100: Hyper-Pipelining Query Execution, 
2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]
 foundation paper on how to implement fast query engine for analytics on modern 
hardware.
 - [Everything You Always Wanted to Know About Compiled and Vectorized Queries 
But Were Afraid to Ask, 
2018|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]:
 This is an update to the paper you linked, which studies Compilation vs 
Vectorization.
 - [Vectorization vs. Compilation in Query Execution, 
2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
 - [Relaxed Operator Fusion for In-Memory Databases: Making Compilation, 
Vectorization, and Prefetching Work Together At Last, 
2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
 - [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence 
in Compiled Query Pipelines, 
2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using 
AVX-512 masked instructions.

Anything by:
 [Daniel J. 
Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of 
the team that wrote CStore which became Vertica. Write less about columnar 
execution in the last years.
 [Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211] 
behind MonetDB/Vectorwize.
 [Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846] 
behind [Hyper|https://hyper-db.de/] bought by Tableau.
 [Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210] 
teaches database course at CMU

Amazing video lectures of courses at CMU, you can ignore most of the storage 
layer, concurrency, transaction context. We're interested in execution engine, 
vectorization, and compilation.
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]

The [CMU 15-721 Advanced Database Systems 
schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually 
a good source of papers.

Relevant code base:
 - [Impala|https://github.com/apache/impala/tree/master/be/src]
 - [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]  
 - [MapD|https://github.com/omnisci/mapd-core/tree/master]
 - [Supersonic|https://github.com/google/supersonic]
 - [Peloton|https://github.com/cmu-db/peloton]

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --------------------------------------------------------------------------
>
>                 Key: ARROW-4333
>                 URL: https://issues.apache.org/jira/browse/ARROW-4333
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query 
> execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to 
> communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable 
> across a ChunkedArray?  How to determine if the order to execution across a 
> ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input 
> and output to the same kernel?
> What does the threading model look like for the higher level of control?  
> Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

Reply via email to