John Omernik created DRILL-5471:
-----------------------------------

             Summary: Provide better documentation around Parquet, Options and 
Integration with Arrow
                 Key: DRILL-5471
                 URL: https://issues.apache.org/jira/browse/DRILL-5471
             Project: Apache Drill
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.10.0
            Reporter: John Omernik


Apache Drill makes heavy use of the Apache Parquet file format.  This is a 
great thing.  In addition, with the advent of Apache Arrow, and JIRAs like 
https://issues.apache.org/jira/browse/DRILL-4455 understanding the integration 
with projects that are important to Drill (Parquet/Arrow) is both important and 
very opaque to end users.  

What do I mean by this? Well that Arrow JIRA is interesting, it looks like 
there is benefit to get Drill and Arrow on the same path, yet, asking the 
community "Is there interest in this?" is a very difficult proposition. I would 
love to chime in on this topic, but I don't understand what is happening enough 
to make an informed comment.  This is true of Arrow, and it's true of Parquet. 

For Parquet, there are two readers included in Apache Drill. There are a number 
of options for encoding in the writer, there settings for row group sizes, 
compression, etc.  How do these all play out?  For end users who are maybe 
trying to read parquet files created with older versions of Parquet, or 
versions of Parquet used by Spark, Impala, Hive etc, how can we better provide 
them some things to try to get better performance or troubleshoot errors in 
queries?

Yes, there are lots of JIRA and/or code comments around projects, however 
having better documentation of where we are now with some of these critical 
projects (Calcite as well?)  are we using releases of those projects? Have we 
rewritten Drills own version (Like a Parquet reader?), are we on forks of other 
projects?  Do we have project goals? I.e. Do we believe it would be a good 
project goal to work to use a standardized Parquet writer, but still use our 
reader? What about the Arrow integration?  What benefits would an end user see? 

For some of these major components, describing what the current challenges are, 
what other potential future states could be, and what those futures states 
could bring the end user could help users generate interest, or even contribute 
to moving the future state forward.  In addition, a page or pages on roadmaps, 
features, tweaks etc in the Documentation website could also help link to 
relevant JIRAs and provide a way to track progress. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to