Todd Lipcon has posted comments on this change. Change subject: Major update to Impala + Kudu page ......................................................................
Patch Set 4: (31 comments) http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_alter_table.xml File docs/topics/impala_alter_table.xml: PS4, Line 887: Any dropped range must not contain : any data values in that range. I don't believe this is true -- dropping a range is an efficient way to bulk-delete a bunch of data for workloads like rolling-window time series http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_compute_stats.xml File docs/topics/impala_compute_stats.xml: Line 55: <!-- Is kudu_partition_spec applicable here? --> nope, because afaik we don't have partition-level stats for kudu (they aren't partitions as far as HMS is concerned) http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_create_table.xml File docs/topics/impala_create_table.xml: PS4, Line 107: expression 'expression' here makes it sound like we support computed defaults. Perhaps it's better to say 'constant' or 'value'? Line 122: RANGE this should be RANGE (<varname>pk_col</varname> [, ...]) right? PS4, Line 125: constant perhaps say 'tuple' or something instead? for composite PKs you need something like: PARTITION ('foo', 1) < VALUES http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_grant.xml File docs/topics/impala_grant.xml: Line 134: Access to Kudu tables must be granted to roles as usual. is it worth noting here that grant/revoke from Impala has no bearing on access via the direct Kudu APIs? http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml File docs/topics/impala_kudu.xml: PS4, Line 39: use the Apache Kudu component stored by Apache Kudu PS4, Line 98: ultiple Kudu hosts. should say "separated by commas" PS4, Line 105: TBLPROPERTIES should say which table property to set PS4, Line 147: logically I'd say these are physical PS4, Line 155: Kudu tablet servers no, just the number of replicas of a given table must be odd. You could have 4 tablet servers, but replication count 3 PS4, Line 156: When you set up a new cluster, the number of tablet servers might not : exactly match the number of DataNodes. If you add or remove one host from an existing : cluster, be careful not to change the number of tablet servers from 3 to 2, or from 99 : to 100, and so on. I'd scrap this whole <p> PS4, Line 200: cannot contain any duplicate values worth noting that it's the _tuple_ that has to be unique. EG you can have PRIMARY KEY (a, b) and have values (1,1), (1,2), (2, 1), (2,2). I don't think that's clear here. PS4, Line 460: <p> : Because primary key columns cannot contain any <codeph>NULL</codeph> values, you can : omit the <codeph>NOT NULL</codeph> clause for the primary key columns. that's true, but I think it's good practice to specify NOT NULL regardless, in case we ever add the ability for a nullable PK. Line 553: them with the default <codeph>PLAIN</codeph> encoding. I'd disagree with this -- in many cases encodings have a net speedup - there's some small CPU cost but the compression ratio can be very good. Especially if you have something like a timestamp value or transaction ID in your PK, bitshuffle will compress it quite well and have a negligible effect on performance of a key lookup, since the cost of decoding one page is nothing compared to the cost of the random disk IO, etc. PS4, Line 592: user_id</codeph> values come from a specific set of strings that's slightly unexpected. I think a better example for dictionary would be something like 'ship_status STRING" from tpch, or 'state' or 'country' PS4, Line 597: subjects might start with the same first words seems somewhat contrived. Prefix encoding isn't super useful for most cases so maybe not worth documenting. (we use it internally for storing the reified composite key index where it is quite effective) PS4, Line 601: The original text : from the <codeph>body</codeph> column is the most frequently accessed, therefore it : uses the compression with the least CPU overhead to expand. The : <codeph>spanish_translation</codeph> is accessed less frequently, therefore it uses : a compression format that produces a smaller result but takes more CPU cycles to hrm, I'm not sure I'd draw such a distinction between LZ4 and SNAPPY. In fact LZ4 has some extensions which we aren't yet using (but should be at some point) which will probably be able to get it to be more dense than snappy as well as being faster, at which point we'd stop encouraging snappy PS4, Line 636: into units known as : <term>cfiles</term>. The cfiles for each column can have a specified block size I think I'd not mention cfiles here, but say that there is an underlying unit of IO which is the "block size". PS4, Line 642: Certain internal characteristics of the data storage use the block as a unit of : measurement. For example, when a column uses dictionary encoding, each block : contains its own dictionary representing only the column values in that block. : < nope, not this block. PS4, Line 648: By default, the block size is 256 KiB. The minimum block size is... The maximum : block size is... : </p> : : <p> : A small block size is good for responsiveness during frequent DML operations. Less data is rearranged : as rows are inserted, deleted, or updated. : </p> : : <p> : A large block size is good for fast query performance for large scan operations. When large volumes : of data are retrieved by a query, it is more efficient to have a small number of decode and uncompress : operations that each apply to a large number of rows. : </p> : : <codeblock> I think it's probably better to just defer the user to the Kudu docs for block size instead of duplicating advice here, given it's a relatively advanced feature, and for most Impala cases the default will be fine. PS4, Line 751: 60 this is cluster-dependent, and based on a Kudu configuration. Would not document a max, but just recommend something like 10 partitions per server in your cluster for large tables. PS4, Line 788: -- 50 buckets for IDs beginning with a lowercase letter : -- plus 50 buckets for IDs beginning with an uppercase letter. : - might be worth a note in the text above that this is multiplicative with any hash partition clauses PS4, Line 874: <p> : When a range is removed, no data can exist in the table within that range. If some : rows do have column values within the removed range, the operation fails. : </p> I don't believe this is true (or at least it's not our intention that it is) PS4, Line 884: <p> : Kudu tables can also use a combination of hash and range partitioning. : </p> fill this out? PS4, Line 908: <p> : To see the distribution of data in a Kudu table across the underlying buckets and : partitions, use the <codeph>SHOW TABLE STATS</codeph> statement. : </p> I don't think this will show per-tablet sizes currently PS4, Line 977: <p> : The database name on the Impala side is or isn't encoded into the underlying Kudu : table name. : </p> not true anymore. now it's my_db::my_table PS4, Line 1007: Because Kudu manages its own storage layer is optimized for smaller block sizes than : HDFS, and performs its own housekeeping to keep data evenly distributed, it is not : some missing words in this sentence http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_refresh.xml File docs/topics/impala_refresh.xml: Line 337: <!-- Anything to say for REFRESH with Kudu? --> AFAIK no http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_show.xml File docs/topics/impala_show.xml: Line 1240: <codeblock rev="1.4.0">[localhost:21000] > show partitions census; is this example supposed to be here? http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml File docs/topics/impala_tables.xml: PS4, Line 308: <codeph># Rows</codeph>, <codeph>Start Key</codeph>, <codeph>Stop Key</codeph>, <codeph>Leader Replica</codeph>, and <codeph># Replicas</codeph>. : worth noting that #rows is inaccurate -- To view, visit http://gerrit.cloudera.org:8080/5649 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: John Russell <[email protected]> Gerrit-Reviewer: Ambreen Kazi <[email protected]> Gerrit-Reviewer: Dimitris Tsirogiannis <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans <[email protected]> Gerrit-Reviewer: John Russell <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
