John Russell has posted comments on this change. Change subject: [DOCS] Major update to Impala + Kudu page ......................................................................
Patch Set 6: (36 comments) http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_invalidate_metadata.xml File docs/topics/impala_invalidate_metadata.xml: PS4, Line 245: <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/> : <p conref="../shared/impala_common.xml#common/kudu_metadata_intro"/> > Where do we talk about Kudu table metadata? We need to mention that source OK, I'm going to implement that with a #include-style mechanism to use the same wording under both REFRESH and INVALIDATE METADATA. The actual text will go in docs/shared/impala_common.xml. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml File docs/topics/impala_kudu.xml: PS4, Line 39: d by the Apache Kudu componen > stored by Apache Kudu OK. I still phrase it as "the Apache Kudu component" just because trademark rules say to always use the trademark as an adjective. E.g. "the ___ database system", "the ___ product", "the ___ storage engine", etc. PS4, Line 98: ultiple Kudu hosts > should say "separated by commas" Done PS4, Line 105: TBLPROPERTIES > should say which table property to set Done PS4, Line 147: physicall > I'd say these are physical Done PS4, Line 149: tablet servers is independent of the number of DataNodes in the cluster. > One question that someone might have is whether tablet servers and data nod I will mention that it's possible but not required. That's something I got from Juan's training session. Is there some nuance where, e.g. a join of a Kudu table with a Parquet table would benefit from tablet server and impalad running on the same host? PS4, Line 156: : One consideration for the cluster topology is that the number of replicas for a Kudu table : must be odd. : </p> > I'd scrap this whole <p> Leave the first sentence or not? There was a fair bit of discussion of tablets and tablet servers in the JD and Juan training sessions. I feel like I'm undercovering it from the Impala side. Although I'm happy to leave that for a future update, or link over to the Kudu docs for such detail. PS4, Line 200: umns, whose values are combined and > worth noting that it's the _tuple_ that has to be unique. EG you can have P Done PS4, Line 201: be unique and cannot contain any : <codeph>NULL</codeph> v > Remove. There are only partitioned Kudu tables. Done PS4, Line 221: On the logical side, the uniqueness constraint allows you to avoid duplicate data in a table. : For example, if an <codeph>INSERT</codeph> operation fails partway through, only some of the : new rows might be present in the table. You can re-run the same <codeph>INSERT</codeph>, and : only the missing rows will be added. Or if data in the table is stale, you can run an : <codeph>UPSERT</codeph> statement that brings > Why blending the semantics of upsert and insert with the description of pri This is a subject I see come up again and again in Kudu discussions, so I might need to include it in multiple places -- when introducing the notion of the PK uniqueness constraint, in the reference pages for INSERT / UPSERT, and in discussions of ETL procedures. Why don't I leave it here for the time being and see what user feedback is like. PS4, Line 233: codeph> clauses and <codeph>NOT NULL</codeph> > How about the nullability restrictions on non-primary key columns. Where ar I'll take out the "on the primary key columns" wording. PS4, Line 247: onbody> : : <p> : For the general syntax of the <codeph>CREATE TABLE</codeph> : st > I am not sure I understand the purpose of this paragraph here. This either I'll remove it from here. PS4, Line 269: : <p outputclass="toc inpage"> : See the following sections for details about each column attribute. : </p> : : </conbody> > Wouldn't it make more sense to first describe the Kudu-specific CREATE TABL I've rearranged this area to be under a "DDL" subheading, and added a couple of links back to the CREATE TABLE page. It'll be a work item in the future to fine-tune how much syntax gets repeated here, or how much detail and examples goes into the SQL Ref pages for DDL and DML statements. PS4, Line 299: : <p> > Even though this is the place to formally define the syntax for PRIMARY KEY Done PS4, Line 368: : The contents of the primary key columns cannot be changed by an : <codeph>UPDATE</codeph> or <co > Remove. Why don't I rephrase it a little. Juan's presentation mentioned how performance could suffer if there were more then 5-6 columns in the primary key. PS4, Line 383: ept> : : <concept id="kudu_not_null_attribute"> : > Only primary key columns can be used in the PARTITION BY clause. This is a Done PS4, Line 399: <p> : For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph> : values, because it is not practical to enforce a <q>not null</q> constraint on HDFS : data files that could be prepared using external tools and ETL processes. : </p> > Move below L411. Done PS4, Line 407: : For example, a table containing geographic information m > I would rephrase it to something like "If an application requires a field t Done Line 438: </p> > Maybe we should extend this paragraph and say that for those reasons, it is Done PS4, Line 449: : <concept id="kudu_default_attribute"> : > Do you suggest that this is not executed because we know it doesn't return This deduction was disproved by the EXPLAIN plan, so removing this example. PS4, Line 460: function calls. : </p> : > that's true, but I think it's good practice to specify NOT NULL regardless, I won't hint at future features, but why don't I say the code is more self-describing with NOT NULL included. PS4, Line 531: : <!-- GROUP_VARINT is internal use only, not documenting that although > I would mention the encodings and maybe some guidelines on which one to use Got some more details on these from Dan Burkert. For the real heavy lifting of definitions and usage, I'll link to the Kudu docs. I see the Impala parser accepts GROUP_VARINT but that gave me an error applying it to TINYINT or BIGINT columns. Does it serve any purpose right now? Line 553: <li> > I'd disagree with this -- in many cases encodings have a net speedup - ther Done PS4, Line 567: > : The following example shows the Impala keywords representing the encoding types. : (The Impala keywords match the symbolic names used within Kudu.) : For usage guidelines on the different kinds of encoding, see : <xref href="https://kudu.apache.org/docs/schema_design.html" scope="external" format="html">the Kudu documentation</xref>. : The <codeph>DESCRIBE</codeph> output > Every compression alg has a tradeoff of compression ratio and cpu overhead I'll ping him offline rather than add him as a reviewer and dump the whole thing on him. PS4, Line 592: ------+-----------------+ > that's slightly unexpected. I think a better example for dictionary would b Done PS4, Line 597: | AUTO_ENCODING | > seems somewhat contrived. Prefix encoding isn't super useful for most cases OK, I'll take out this sentence entirely. PS4, Line 601: : | c7 | string | false | true | PREFIX_ENCODING | : +------+---------+-------------+----------+-----------------+ : </codeblock> : > hrm, I'm not sure I'd draw such a distinction between LZ4 and SNAPPY. In fa Done PS4, Line 636: : <p> > I think I'd not mention cfiles here, but say that there is an underlying un Done PS4, Line 642: therefore this column is a good candidate for dictionary encoding. The : <codeph>post_id</codeph> column contains an ascending sequence of integers, where : several leading bits are likely to be all zeroes, therefore this column is a good : > nope, not this block. Done PS4, Line 648: they employ the <codeph>COMPRESSION</codeph> attribute instead. The ideal compression : codec in each case would require some experimentation to determine how much space : savings it provided and how much CPU overhead it added, based on real-world data. : </p> : : <codeblock> : CREATE TABLE blog_posts : ( : user_id STRING ENCODING DICT_ENCODING, : post_id BIGINT ENCODING BIT_SHUFFLE, : subject STRING ENCODING PLAIN_ENCODING, : body STRING COMPRESSION LZ4, : spanish_translation STRING COMPRESSION SNAPPY, : esperanto_translation STRING COMPRESSION ZLIB, : PRIMARY KEY (user_id, post_id) : ) PARTITION > I think it's probably better to just defer the user to the Kudu docs for bl Where's the right page to link to? I find most references to block size on client API reference pages like https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduColumnStorageAttributes.html which don't have much usage advice. PS4, Line 690: > PARTITION Done PS4, Line 691: LE performance_for_benchmark_xyz : ( : id BIGINT PRIMARY KEY, > Why not putting the formal syntax instead of trying to describe the differe I'm trying to outline the concepts behind Kudu partitioning before sending readers off to the CREATE TABLE or ALTER TABLE pages for the full syntax. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_refresh.xml File docs/topics/impala_refresh.xml: Line 337: <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/> > AFAIK no I'm going to reuse the same discussion of when to run REFRESH or INVALIDATE METADATA, and why they're needed less frequently, as I put under INVALIDATE METADATA. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_show.xml File docs/topics/impala_show.xml: Line 1240: partition covers the same range as originally specified. > is this example supposed to be here? This is the spot for examples of this statement for tables of all types. I'll clarify with some intro text, and move the earlier Kudu example down here too. http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml File docs/topics/impala_tables.xml: PS4, Line 293: Tables using the Apache Kudu storage system are treated specially, because Kudu manages its data independently of HDFS files. : Some information about the table is stored in the metastore database for use by Impala. Other table metadata is : managed internally by Kudu. > Where do we talk about external vs managed Kudu tables? Details to be added. I have been working purely with Impala-created tables so my knowledge about external tables is mainly from the Kudu training sessions. PS4, Line 308: <codeph># Rows</codeph> (although this number is not currently computed), <codeph>Start Key</codeph>, <codeph>Stop Key</codeph>, <codeph>Leader Replica</codeph>, and <codeph># Replicas</codeph>. : > worth noting that #rows is inaccurate Done -- To view, visit http://gerrit.cloudera.org:8080/5649 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <[email protected]> Gerrit-Reviewer: Ambreen Kazi <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: John Russell <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
