[Impala-ASF-CR] Major update to Impala + Kudu page

Todd Lipcon (Code Review) Tue, 10 Jan 2017 18:57:34 -0800

Todd Lipcon has posted comments on this change.

Change subject: Major update to Impala + Kudu page
......................................................................



Patch Set 4:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_alter_table.xml
File docs/topics/impala_alter_table.xml:

PS4, Line 887: Any dropped range must not contain
             :       any data values in that range.
I don't believe this is true -- dropping a range is an efficient way to 
bulk-delete a bunch of data for workloads like rolling-window time series


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_compute_stats.xml
File docs/topics/impala_compute_stats.xml:

Line 55: <!-- Is kudu_partition_spec applicable here? -->
nope, because afaik we don't have partition-level stats for kudu (they aren't 
partitions as far as HMS is concerned)


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_create_table.xml
File docs/topics/impala_create_table.xml:

PS4, Line 107: expression
'expression' here makes it sound like we support computed defaults. Perhaps 
it's better to say 'constant' or 'value'?


Line 122:   RANGE
this should be RANGE (<varname>pk_col</varname> [, ...]) right?


PS4, Line 125: constant
perhaps say 'tuple' or something instead? for composite PKs you need something 
like: PARTITION ('foo', 1) < VALUES


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_grant.xml
File docs/topics/impala_grant.xml:

Line 134:       Access to Kudu tables must be granted to roles as usual.
is it worth noting here that grant/revoke from Impala has no bearing on access 
via the direct Kudu APIs?


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS4, Line 39: use the Apache Kudu component
stored by Apache Kudu


PS4, Line 98: ultiple Kudu hosts.
should say "separated by commas"


PS4, Line 105: TBLPROPERTIES
should say which table property to set


PS4, Line 147: logically
I'd say these are physical


PS4, Line 155: Kudu tablet servers
no, just the number of replicas of a given table must be odd. You could have 4 
tablet servers, but replication count 3


PS4, Line 156:  When you set up a new cluster, the number of tablet servers 
might not
             :           exactly match the number of DataNodes. If you add or 
remove one host from an existing
             :           cluster, be careful not to change the number of tablet 
servers from 3 to 2, or from 99
             :           to 100, and so on.
I'd scrap this whole <p>


PS4, Line 200: cannot contain any duplicate values
worth noting that it's the _tuple_ that has to be unique. EG you can have 
PRIMARY KEY (a, b) and have values (1,1), (1,2), (2, 1), (2,2). I don't think 
that's clear here.


PS4, Line 460:           <p>
             :             Because primary key columns cannot contain any 
<codeph>NULL</codeph> values, you can
             :             omit the <codeph>NOT NULL</codeph> clause for the 
primary key columns.
that's true, but I think it's good practice to specify NOT NULL regardless, in 
case we ever add the ability for a nullable PK.


Line 553:             them with the default <codeph>PLAIN</codeph> encoding.
I'd disagree with this -- in many cases encodings have a net speedup - there's 
some small CPU cost but the compression ratio can be very good. Especially if 
you have something like a timestamp value or transaction ID in your PK, 
bitshuffle will compress it quite well and have a negligible effect on 
performance of a key lookup, since the cost of decoding one page is nothing 
compared to the cost of the random disk IO, etc.


PS4, Line 592: user_id</codeph> values come from a specific set of strings
that's slightly unexpected. I think a better example for dictionary would be 
something like 'ship_status STRING" from tpch, or 'state' or 'country'


PS4, Line 597: subjects might start with the same first words
seems somewhat contrived. Prefix encoding isn't super useful for most cases so 
maybe not worth documenting. (we use it internally for storing the reified 
composite key index where it is quite effective)


PS4, Line 601: The original text
             :             from the <codeph>body</codeph> column is the most 
frequently accessed, therefore it
             :             uses the compression with the least CPU overhead to 
expand. The
             :             <codeph>spanish_translation</codeph> is accessed 
less frequently, therefore it uses
             :             a compression format that produces a smaller result 
but takes more CPU cycles to
hrm, I'm not sure I'd draw such a distinction between LZ4 and SNAPPY. In fact 
LZ4 has some extensions which we aren't yet using (but should be at some point) 
which will probably be able to get it to be more dense than snappy as well as 
being faster, at which point we'd stop encouraging snappy


PS4, Line 636: into units known as
             :             <term>cfiles</term>. The cfiles for each column can 
have a specified block size
I think I'd not mention cfiles here, but say that there is an underlying unit 
of IO which is the "block size".


PS4, Line 642:   Certain internal characteristics of the data storage use the 
block as a unit of
             :             measurement. For example, when a column uses 
dictionary encoding, each block
             :             contains its own dictionary representing only the 
column values in that block.
             :           <
nope, not this block.


PS4, Line 648:   By default, the block size is 256 KiB. The minimum block size 
is... The maximum
             :             block size is...
             :           </p>
             : 
             :           <p>
             :             A small block size is good for responsiveness during 
frequent DML operations. Less data is rearranged
             :             as rows are inserted, deleted, or updated.
             :           </p>
             : 
             :           <p>
             :             A large block size is good for fast query 
performance for large scan operations. When large volumes
             :             of data are retrieved by a query, it is more 
efficient to have a small number of decode and uncompress
             :             operations that each apply to a large number of rows.
             :           </p>
             : 
             : <codeblock>
I think it's probably better to just defer the user to the Kudu docs for block 
size instead of duplicating advice here, given it's a relatively advanced 
feature, and for most Impala cases the default will be fine.


PS4, Line 751: 60
this is cluster-dependent, and based on a Kudu configuration. Would not 
document a max, but just recommend something like 10 partitions per server in 
your cluster for large tables.


PS4, Line 788: -- 50 buckets for IDs beginning with a lowercase letter
             : -- plus 50 buckets for IDs beginning with an uppercase letter.
             : -
might be worth a note in the text above that this is multiplicative with any 
hash partition clauses


PS4, Line 874:           <p>
             :             When a range is removed, no data can exist in the 
table within that range. If some
             :             rows do have column values within the removed range, 
the operation fails.
             :           </p>
I don't believe this is true (or at least it's not our intention that it is)


PS4, Line 884:         <p>
             :           Kudu tables can also use a combination of hash and 
range partitioning.
             :         </p>
fill this out?


PS4, Line 908:           <p>
             :             To see the distribution of data in a Kudu table 
across the underlying buckets and
             :             partitions, use the <codeph>SHOW TABLE 
STATS</codeph> statement.
             :           </p>
I don't think this will show per-tablet sizes currently


PS4, Line 977:         <p>
             :           The database name on the Impala side is or isn't 
encoded into the underlying Kudu
             :           table name.
             :         </p>
not true anymore. now it's my_db::my_table


PS4, Line 1007:         Because Kudu manages its own storage layer is optimized 
for smaller block sizes than
              :         HDFS, and performs its own housekeeping to keep data 
evenly distributed, it is not
              :  
some missing words in this sentence


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_refresh.xml
File docs/topics/impala_refresh.xml:

Line 337:     <!-- Anything to say for REFRESH with Kudu? -->
AFAIK no


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_show.xml
File docs/topics/impala_show.xml:

Line 1240: <codeblock rev="1.4.0">[localhost:21000] &gt; show partitions census;
is this example supposed to be here?


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml
File docs/topics/impala_tables.xml:

PS4, Line 308:         <codeph># Rows</codeph>, <codeph>Start Key</codeph>, 
<codeph>Stop Key</codeph>, <codeph>Leader Replica</codeph>, and <codeph># 
Replicas</codeph>.
             :   
worth noting that #rows is inaccurate


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <[email protected]>
Gerrit-Reviewer: Ambreen Kazi <[email protected]>
Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]>
Gerrit-Reviewer: John Russell <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] Major update to Impala + Kudu page

Reply via email to