FYI: I started an Apache wiki page with an outline of existing docs and current thoughts/status about how to disentangle the parts that overlap with CDH-specific or CM-specific portions:
https://cwiki.apache.org/confluence/display/IMPALA/Documentation Probably people who are interested in that side of the project should put a watch on that page. I guess we should also think about things related to performance / planning / sizing. A considerable amount of content from the “Impala Performance Cookbook” are reproduced in the docs. Is that all appropriate to donate to Apache? Or are things that are mainly “observations from / advice to the FCE group” appropriate to keep more on the Cloudera side? Probably we’ll need to look at that in closer detail. I’m talking about pages like: http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_cluster_sizing.html http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_schema_design.html http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_perf_skew.html http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/impala_scalability.html For the moment, my expectation is that the doc material that is donated to Apache will continue to be reproduced (perhaps with CDH-specific bonus content) in the main Cloudera docs. There is some degree of customer expectation that we’ve set, and I don’t see a good way to remove the material entirely without causing a lot of disruption. E.g. readers and searchers would find many instances of older Impala docs on cloudera.com, but not the latest ones. I don’t think we could physically remove the Impala doc content during the life of the CDH 5.x series; I don’t know if it would be practical right at the launch of CDH (N>5). That’s something we can revisit as we go forward as we develop more concrete ideas. John > On Jan 4, 2016, at 6:55 AM, Tom White <[email protected]> wrote: > > On Thu, Dec 31, 2015 at 5:17 AM, John Russell <[email protected]> wrote: >> >> I would say there's a fair bit of decision-making and followup work having >> to do with documentation. > > Agreed. This doesn't need to be tied up with the podling report > though, so I've changed the subject line to reflect that. > >> >> For example, the current Impala docs that are embedded within the Cloudera >> doc library cover a wide range of subjects: >> >> - "How to use Impala with <component XYZ>". For example, Impala with >> Sentry, Impala with HBase, Impala with S3, Impala with Isilon... Some >> components are Hadoop-based, others are more specific to what's shipped or >> integrated with CDH. I feel like we should have a spreadsheet because these >> seem like decisions to make on a case-by-case basis. >> >> - "How to do <task XYZ> with Impala". Performance tuning, troubleshooting, >> deployment planning. Same kinds of considerations as the previous bullet. >> Many of these aren't strictly part of core Impala features, rather they're >> things that could have been delivered via blog posts, O'Reilly books, etc. >> Again, there could be some amount of identifying / deciding / untangling to >> produce the right subset to go in Apache-oriented docs. >> >> - "How to do <task XYZ> with Impala in Cloudera Manager". That seems like >> an easy call to say, that kind of stuff doesn't get donated to Apache >> because it's CDH-specific. That kind of content though is intermixed with >> "how to do <task XYZ> _without_ Cloudera Manager" so it would be some work >> to untangle instructions like that. >> >> - "CREATE TABLE" and similar language reference stuff. Doesn't every SQL >> engine in the open source arena come with a language reference of one sort >> or another... So I assume there has to be something either donated or >> created from scratch along those lines. (Although my open source experience >> is with MySQL, where the docs are under a more restrictive license than the >> software, so I don't have exact precedents to go by.) >> >> Assuming that some amount of existing CDH doc is donated, then for purposes >> of building, accepting contributions, etc. do we need to convert the content >> to some particular format or use some specific build system? The doc >> content that I'm talking about is currently in XML, with a DTD (DITA) that >> can be built using an all-open-source toolchain. The format and toolchain >> might be a little more heavyweight than on a lot of other Apache projects. > > There's no mandated documentation system for projects at Apache, so > using DITA shouldn't be a problem, especially since it can be built > using an open source toolchain, as you point out. Having some > instructions on how to build the docs would be useful if they don't > already exist. > >> >> The main advantage of the current format for the Impala doc library is ease >> of reuse. So there's the question of whether Apache-donated stuff doc like >> language reference then _only_ exists in the context of the project site, or >> gets reused within the doc library on cloudera.com. There are pros and cons >> either way. Even if we centralize future docs on the impala.io site, so >> there isn't a new instance corresponding to each new CDH x.y release, there >> are still all the older instances of those pages from CDH 4.x, CDH 5.x, >> Impala 1.x, and Impala 2.x docs on cloudera.com. >> >> I've been cogitating over these considerations the last few weeks, but no >> approach has really jumped out at me as a slam dunk: >> >> a) Rip as much existing doc out of the Cloudera library as possible, convert >> to the most contributor-friendly format, decouple entirely from the CDH >> library? >> b) Donate core Impala feature docs only, keep the XML format the same, >> encourage verbatim reuse of doc content across CDH and other distributions >> that include Impala? > > I would vote for a combination of a and b - donate all > non-CDH-specific doc, and keep the existing XML format (DITA). > > Cheers, > Tom > >> c) Some middle ground? For example, it would be possible to mix and match >> the current XML doc format with user-contributed content in Markdown format. >> >> Thanks, >> John >> >>> On Dec 30, 2015, at 3:07 PM, Henry Robinson <[email protected]> wrote: >>> >>> Hi all - >>> >>> Here's a draft of our inaugural podling report. Per the usual guidelines, >>> Impala has to submit three monthly reports to the Incubator PPMC, after >>> which we report every quarter. The purpose of the report is to expose the >>> current state of the graduation effort to the Incubator, and to flag any >>> problems that require Incubator attention. >>> >>> I hope this report also sheds a little light on what is needed to be done >>> to move Impala's development in its entirety to the ASF and its >>> infrastructure. We are looking forward to making quick progress on some of >>> these items in 2016. >>> >>> If anyone has any further comments or edits they'd like to make, please >>> respond to this thread. I am on a short timeline as I fly internationally >>> tomorrow and will be out of contact for about ten days, so I plan to post >>> this to the Incubator wiki tomorrow morning. Any edits can then be made >>> there. >>> >>> Thanks, >>> Henry >>> >>> -------------------- >>> Impala >>> >>> Impala is a high-performance C++ and Java SQL query engine for data stored >>> in >>> Apache Hadoop-based clusters. >>> >>> Impala has been incubating since 2015-12-03. >>> >>> Three most important issues to address in the move towards graduation: >>> >>> 1. Resolve any issues around use of Gerrit as code-review tool. >>> 2. Movement of existing JIRA / Git / wiki / e-mail resources to Apache >>> equivalents >>> 3. Initial release as incubating project. >>> >>> Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be >>> aware of? >>> >>> None. >>> >>> How has the community developed since the last report? >>> >>> Slowly - Impala is still in the very early stages of incubation, and >>> performing the mechanical tasks of code movement and infrastructure setup >>> is our first priority. The holiday period in the United States has slowed >>> this effort slightly, but we look forward to picking up pace in early 2016. >>> There have been no additions to the committer or PMC lists since incubation >>> began. >>> >>> How has the project developed since the last report? >>> >>> We have performed some of the basic initial tasks for incubation - >>> establishing wiki pages, Git repositories and accounts for the initial >>> committer set. Our next steps are: >>> >>> 1. Finalize the SGA from Cloudera >>> 2. Move existing @cloudera.org e-mail aliases to their @ >>> impala.incubator.apache.org equivalents. >>> 3. Move source code from Cloudera git repository to Apache git repo. >>> 4. Improve out-of-box build and test experience so that community can >>> easily evaluate release artifacts. >>> 5. Migrate cloudera.org JIRA tickets to issues.apache.org. >>> >>> >>> Date of last release: >>> >>> NA >>> >>> When were the last committers or PMC members elected? >>> >>> At the time of the Incubation vote, 2015-12-03. >>
