[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

John Russell (Code Review) Thu, 19 Jan 2017 22:51:03 -0800

John Russell has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................



Patch Set 6:

(36 comments)

http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_invalidate_metadata.xml
File docs/topics/impala_invalidate_metadata.xml:

PS4, Line 245: <p rev="kudu" 
conref="../shared/impala_common.xml#common/kudu_blurb"/>
             :     <p 
conref="../shared/impala_common.xml#common/kudu_metadata_intro"/>
> Where do we talk about Kudu table metadata? We need to mention that source 
OK, I'm going to implement that with a #include-style mechanism to use the same 
wording under both REFRESH and INVALIDATE METADATA. The actual text will go in 
docs/shared/impala_common.xml.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS4, Line 39: d by the Apache Kudu componen
> stored by Apache Kudu
OK. I still phrase it as "the Apache Kudu component" just because trademark 
rules say to always use the trademark as an adjective. E.g. "the ___ database 
system", "the ___ product", "the ___ storage engine", etc.


PS4, Line 98: ultiple Kudu hosts 
> should say "separated by commas"
Done


PS4, Line 105: TBLPROPERTIES
> should say which table property to set
Done


PS4, Line 147: physicall
> I'd say these are physical
Done


PS4, Line 149: tablet servers is independent of the number of DataNodes in the 
cluster.
> One question that someone might have is whether tablet servers and data nod
I will mention that it's possible but not required. That's something I got from 
Juan's training session. Is there some nuance where, e.g. a join of a Kudu 
table with a Parquet table would benefit from tablet server and impalad running 
on the same host?


PS4, Line 156: 
             :           One consideration for the cluster topology is that the 
number of replicas for a Kudu table
             :           must be odd.
             :         </p>
> I'd scrap this whole <p>
Leave the first sentence or not? There was a fair bit of discussion of tablets 
and tablet servers in the JD and Juan training sessions. I feel like I'm 
undercovering it from the Impala side. Although I'm happy to leave that for a 
future update, or link over to the Kudu docs for such detail.


PS4, Line 200: umns, whose values are combined and
> worth noting that it's the _tuple_ that has to be unique. EG you can have P
Done


PS4, Line 201:  be unique and cannot contain any
             :           <codeph>NULL</codeph> v
> Remove. There are only partitioned Kudu tables.
Done


PS4, Line 221: On the logical side, the uniqueness constraint allows you to 
avoid duplicate data in a table.
             :               For example, if an <codeph>INSERT</codeph> 
operation fails partway through, only some of the
             :               new rows might be present in the table. You can 
re-run the same <codeph>INSERT</codeph>, and
             :               only the missing rows will be added. Or if data in 
the table is stale, you can run an
             :               <codeph>UPSERT</codeph> statement that brings 
> Why blending the semantics of upsert and insert with the description of pri
This is a subject I see come up again and again in Kudu discussions, so I might 
need to include it in multiple places -- when introducing the notion of the PK 
uniqueness constraint, in the reference pages for INSERT / UPSERT, and in 
discussions of ETL procedures. Why don't I leave it here for the time being and 
see what user feedback is like.


PS4, Line 233: codeph> clauses and <codeph>NOT NULL</codeph>
> How about the nullability restrictions on non-primary key columns. Where ar
I'll take out the "on the primary key columns" wording.


PS4, Line 247: onbody>
             : 
             :         <p>
             :           For the general syntax of the <codeph>CREATE 
TABLE</codeph>
             :           st
> I am not sure I understand the purpose of this paragraph here. This either 
I'll remove it from here.


PS4, Line 269: 
             :         <p outputclass="toc inpage">
             :           See the following sections for details about each 
column attribute.
             :         </p>
             : 
             :       </conbody>
> Wouldn't it make more sense to first describe the Kudu-specific CREATE TABL
I've rearranged this area to be under a "DDL" subheading, and added a couple of 
links back to the CREATE TABLE page. It'll be a work item in the future to 
fine-tune how much syntax gets repeated here, or how much detail and examples 
goes into the SQL Ref pages for DDL and DML statements.


PS4, Line 299: 
             :           <p>
> Even though this is the place to formally define the syntax for PRIMARY KEY
Done


PS4, Line 368: 
             :             The contents of the primary key columns cannot be 
changed by an
             :             <codeph>UPDATE</codeph> or <co
> Remove.
Why don't I rephrase it a little. Juan's presentation mentioned how performance 
could suffer if there were more then 5-6 columns in the primary key.


PS4, Line 383: ept>
             : 
             :       <concept id="kudu_not_null_attribute">
             : 
> Only primary key columns can be used in the PARTITION BY clause. This is a 
Done


PS4, Line 399: <p>
             :             For non-Kudu tables, Impala allows any column to 
contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a 
<q>not null</q> constraint on HDFS
             :             data files that could be prepared using external 
tools and ETL processes.
             :           </p>
> Move below L411.
Done


PS4, Line 407: 
             :             For example, a table containing geographic 
information m
> I would rephrase it to something like "If an application requires a field t
Done


Line 438:           </p>
> Maybe we should extend this paragraph and say that for those reasons, it is
Done


PS4, Line 449: 
             :       <concept id="kudu_default_attribute">
             : 
> Do you suggest that this is not executed because we know it doesn't return 
This deduction was disproved by the EXPLAIN plan, so removing this example.


PS4, Line 460:             function calls.
             :           </p>
             : 
> that's true, but I think it's good practice to specify NOT NULL regardless,
I won't hint at future features, but why don't I say the code is more 
self-describing with NOT NULL included.


PS4, Line 531: 
             :               <!-- GROUP_VARINT is internal use only, not 
documenting that although 
> I would mention the encodings and maybe some guidelines on which one to use
Got some more details on these from Dan Burkert. For the real heavy lifting of 
definitions and usage, I'll link to the Kudu docs.

I see the Impala parser accepts GROUP_VARINT but that gave me an error applying 
it to TINYINT or BIGINT columns. Does it serve any purpose right now?


Line 553:               <li>
> I'd disagree with this -- in many cases encodings have a net speedup - ther
Done


PS4, Line 567: >
             :             The following example shows the Impala keywords 
representing the encoding types.
             :             (The Impala keywords match the symbolic names used 
within Kudu.)
             :             For usage guidelines on the different kinds of 
encoding, see
             :             <xref 
href="https://kudu.apache.org/docs/schema_design.html"; scope="external" 
format="html">the Kudu documentation</xref>.
             :             The <codeph>DESCRIBE</codeph> output
> Every compression alg has a tradeoff of compression ratio and cpu overhead 
I'll ping him offline rather than add him as a reviewer and dump the whole 
thing on him.


PS4, Line 592: ------+-----------------+
> that's slightly unexpected. I think a better example for dictionary would b
Done


PS4, Line 597:    | AUTO_ENCODING   |
> seems somewhat contrived. Prefix encoding isn't super useful for most cases
OK, I'll take out this sentence entirely.


PS4, Line 601: 
             : | c7   | string  | false       | true     | PREFIX_ENCODING |
             : +------+---------+-------------+----------+-----------------+
             : </codeblock>
             : 
> hrm, I'm not sure I'd draw such a distinction between LZ4 and SNAPPY. In fa
Done


PS4, Line 636: 
             :           <p>
> I think I'd not mention cfiles here, but say that there is an underlying un
Done


PS4, Line 642:   therefore this column is a good candidate for dictionary 
encoding. The
             :             <codeph>post_id</codeph> column contains an 
ascending sequence of integers, where
             :             several leading bits are likely to be all zeroes, 
therefore this column is a good
             :            
> nope, not this block.
Done


PS4, Line 648:   they employ the <codeph>COMPRESSION</codeph> attribute 
instead. The ideal compression
             :             codec in each case would require some 
experimentation to determine how much space
             :             savings it provided and how much CPU overhead it 
added, based on real-world data.
             :           </p>
             : 
             : <codeblock>
             : CREATE TABLE blog_posts
             : (
             :   user_id STRING ENCODING DICT_ENCODING,
             :   post_id BIGINT ENCODING BIT_SHUFFLE,
             :   subject STRING ENCODING PLAIN_ENCODING,
             :   body STRING COMPRESSION LZ4,
             :   spanish_translation STRING COMPRESSION SNAPPY,
             :   esperanto_translation STRING COMPRESSION ZLIB,
             :   PRIMARY KEY (user_id, post_id)
             : ) PARTITION
> I think it's probably better to just defer the user to the Kudu docs for bl
Where's the right page to link to? I find most references to block size on 
client API reference pages like

https://kudu.apache.org/cpp-client-api/classkudu_1_1client_1_1KuduColumnStorageAttributes.html

which don't have much usage advice.


PS4, Line 690: 
> PARTITION
Done


PS4, Line 691: LE performance_for_benchmark_xyz
             : (
             :   id BIGINT PRIMARY KEY,
> Why not putting the formal syntax instead of trying to describe the differe
I'm trying to outline the concepts behind Kudu partitioning before sending 
readers off to the CREATE TABLE or ALTER TABLE pages for the full syntax.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_refresh.xml
File docs/topics/impala_refresh.xml:

Line 337:     <p rev="kudu" 
conref="../shared/impala_common.xml#common/kudu_blurb"/>
> AFAIK no
I'm going to reuse the same discussion of when to run REFRESH or INVALIDATE 
METADATA, and why they're needed less frequently, as I put under INVALIDATE 
METADATA.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_show.xml
File docs/topics/impala_show.xml:

Line 1240:         partition covers the same range as originally specified.
> is this example supposed to be here?
This is the spot for examples of this statement for tables of all types. I'll 
clarify with some intro text, and move the earlier Kudu example down here too.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml
File docs/topics/impala_tables.xml:

PS4, Line 293: Tables using the Apache Kudu storage system are treated 
specially, because Kudu manages its data independently of HDFS files.
             :         Some information about the table is stored in the 
metastore database for use by Impala. Other table metadata is
             :         managed internally by Kudu.
> Where do we talk about external vs managed Kudu tables?
Details to be added. I have been working purely with Impala-created tables so 
my knowledge about external tables is mainly from the Kudu training sessions.


PS4, Line 308:         <codeph># Rows</codeph> (although this number is not 
currently computed), <codeph>Start Key</codeph>, <codeph>Stop Key</codeph>, 
<codeph>Leader Replica</codeph>, and <codeph># Replicas</codeph>.
             :   
> worth noting that #rows is inaccurate
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <[email protected]>
Gerrit-Reviewer: Ambreen Kazi <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]>
Gerrit-Reviewer: John Russell <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page

Reply via email to