John Omernik created DRILL-5471:
-----------------------------------
Summary: Provide better documentation around Parquet, Options and
Integration with Arrow
Key: DRILL-5471
URL: https://issues.apache.org/jira/browse/DRILL-5471
Project: Apache Drill
Issue Type: Improvement
Components: Documentation
Affects Versions: 1.10.0
Reporter: John Omernik
Apache Drill makes heavy use of the Apache Parquet file format. This is a
great thing. In addition, with the advent of Apache Arrow, and JIRAs like
https://issues.apache.org/jira/browse/DRILL-4455 understanding the integration
with projects that are important to Drill (Parquet/Arrow) is both important and
very opaque to end users.
What do I mean by this? Well that Arrow JIRA is interesting, it looks like
there is benefit to get Drill and Arrow on the same path, yet, asking the
community "Is there interest in this?" is a very difficult proposition. I would
love to chime in on this topic, but I don't understand what is happening enough
to make an informed comment. This is true of Arrow, and it's true of Parquet.
For Parquet, there are two readers included in Apache Drill. There are a number
of options for encoding in the writer, there settings for row group sizes,
compression, etc. How do these all play out? For end users who are maybe
trying to read parquet files created with older versions of Parquet, or
versions of Parquet used by Spark, Impala, Hive etc, how can we better provide
them some things to try to get better performance or troubleshoot errors in
queries?
Yes, there are lots of JIRA and/or code comments around projects, however
having better documentation of where we are now with some of these critical
projects (Calcite as well?) are we using releases of those projects? Have we
rewritten Drills own version (Like a Parquet reader?), are we on forks of other
projects? Do we have project goals? I.e. Do we believe it would be a good
project goal to work to use a standardized Parquet writer, but still use our
reader? What about the Arrow integration? What benefits would an end user see?
For some of these major components, describing what the current challenges are,
what other potential future states could be, and what those futures states
could bring the end user could help users generate interest, or even contribute
to moving the future state forward. In addition, a page or pages on roadmaps,
features, tweaks etc in the Documentation website could also help link to
relevant JIRAs and provide a way to track progress.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)