alamb commented on issue #6782:
URL:
https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1630516792
Some ideas about the paper:
# Thesis:
We demonstrate it is possible to get DuckDB like performance using standards
like Parquet and Arrow as the internal interchange format, both inside of and
outside of the engine. Previously the conventional wisdom has been that such
performance levels require a tightly integrated engine where the disk format,
in memory layout, and processing engine are engineered in tandem to work well
together.
While the engineering effort required for such an engine is large, it is
possible by leveraging the open source model and Apache governance model to
poll resources amongst users. Given the availablity of fast, standards based,
interoperable vectorized engines like DataFusion, we predict a Cambrian
explosion of new analytic systems which would not have been possible before if
they had to create their own engines
# Compare / Contrast Similar systems:
Velox (focuses on the execution engine side)
Apache Calcite (focuses on sql and frontend)
DataFusion has all the pieces of the toolkit (sql frontend, logical plan,
and execution plans)
Also, Rust
Internally DataFusion uses Arrow as the interchange between operators,
though internally different, non standard formats are used (such as the Arrow
Row Format)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]