[ https://issues.apache.org/jira/browse/PHOENIX-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823287#comment-15823287 ]
Josh Elser commented on PHOENIX-3218: ------------------------------------- {noformat} especially as it affects the underlying HBase row keys {noformat} (look-back) "Apache HBase" for the first reference, please. {noformat} If you do lots of random gets, make sure you use SSDs, or that your working set fits into RAM (either OS cache or the HBase block cache), or performance will be truly terrible {noformat} I feel like this is one of those statements that will end up causing a lot of "fud" to undo. HBase is *still* performing (essentially) {{log(n)}} lookups to find the key which is "fast". I feel like this would just promote the impression that Phoenix doesn't work on well for random reads without SSDs or high-memory systems which is wrong. Could you re-word this to be a recommendation that random-read workloads benefit from fast disks/large block caches over sequential read workloads? {noformat} * Use multiple indexes to provide fast access to common queries. * Create global indexes. This will affect write speed depending on the number of columns included in an index because each index writes to its own separate table. {noformat} Switch the ordering here to make more sense. Point 1 is that indexes are good and that you should use them. Point 2 is that sometime having multiple indexes is a good thing. {noformat} * When specifying machines for HBase, do not skimp on cores; HBase needs them. {noformat} How can this be made into a more concrete recommendation? Do you have any recommendations to make WRT types of disk and amount of memory available? {noformat} * Create additional indexes to support common query patterns, including all fields that need to be retrieved. {noformat} A bit duplicative of the above section. Perhaps reword this to be more focused on ensuring indexes exist for columns that don't exist solely in the primary key constraint but are heavily accessed? That isn't entirely correct either, but maybe closer.. {noformat} if a region server goes down {noformat} Recommend: s/goes down/fails/ {noformat} Set the `UPDATE_CACHE_FREQUENCY` [option](http://phoenix.apache.org/language/index.html#options) to 15 minutes or so if your metadata doesn't change very often {noformat} Don't guess, make a concrete recommendation. If 15minutes isn't a good recommendation, let's come up with a good number. Should "metadata" be "table schema"? Does it also include table properties (such as immutable_rows)? {noformat} On AWS, you'll need to manually start the job {noformat} Why the mention of "On AWS"? This is the same for an on-prem cluster with async indexes, no? {noformat} facilitates skip-scanning {noformat} Link to the docs on skip-scans. {noformat} For example, if you need indexes to stay in sync with data tables even if machines go down and writes fail, then you should consider your data transactional {noformat} This is misleading as Phoenix maintains referential integrity in the face of RS failure. A better use-case for transactions would be for cross-row updates to a data-table. {noformat} * Schema Design * Indexes * Explain Plans and Hints * Queries {noformat} Links to the documentation? {noformat} Each row has a key, a byte-array by which rows in HBase are sorted to make queries faster. All table accesses are via the row key (the table's primary key) {noformat} It would be better to be very explicit in the use of terminology to avoid confusion. e.g. "An HBase row is a collection of many key-value pairs in which the rowkey attribute of the keys are equal. Data in an HBase table is sorted by the rowkey." Also s/row key/rowkey/. {noformat} If some columns are accessed more frequently than others, use column families to separate the frequently-accessed columns from rarely-accessed columns. This improves performance because HBase reads only the column families specified in the query. {noformat} Link to the docs on how to do this. {noformat} stores a copy of some or all of the data in the main table {noformat} Stores a pivoted copy {noformat} See also: https://phoenix.apache.org/secondary_indexing.html {noformat} Linkify {noformat} don't require you to change your queries at all—they just make them run faster {noformat} Suggest: "don't require change to existing queries -- queries simply run faster" {noformat} The sweet spot is generally a handful of secondary indexes {noformat} Suggest, remove the colloquialism. Doesn't match the rest of the tone of the document. {noformat} Depending on your needs, consider creating *covered* indexes or *functional* indexes, or both. {noformat} Link to docs on covered and functional indexes, please {noformat} If you regularly scan large data sets from spinning disk, you're best off with GZIP (but watch write speed) {noformat} Numbers/reference-material to back this up? {noformat} For Gets it is quite important to have your data set cached, and you should use the HBase block cache. {noformat} Flip-flopping between "scans"/"gets" and "range queries"/"point lookups". I think using the latter terminology universally would be better. {noformat} When using `UPSERT` to write a large number of records, turn off autocommit and batch records. Start with a batch size of 1000 and adjust as needed. Here's some pseudocode showing one way to commit records in batches: {noformat} Recommend putting a caveat here that the use of {{commit()}} by Phoenix to control batches of data written to HBase as being "non-standard" in terms of JDBC. The {{executeBatch()}} APIs calls would be the standard way to batch updates to the database for other JDBC drivers. Would recommend that we at least acknowledge that Phoenix is doing it "differently". {noformat} Otherwise, replication triples the cost of each write {noformat} This is inaccurate, misleading at best. Whether the data is written to a secondary index or a local index, the underlying data is *still* stored at the configured HDFS replication rate (3x by default). The performance gain is that the RegionServer is not updating another Region for the data-table update. {noformat} When deleting a large data set, turn on autoCommit before issuing the `DELETE` query so that the client does not need to remember the row keys of all the keys as they are deleted. {noformat} Reasoning behind this one isn't clear to me. Batching DELETEs would have the same benefit of batching UPSERTs, no? (I may just be missing an implementation detail here..) The explain section is *fantastic*. Great job there. Overall, this is a very nice write-up you've put together, [~pconrad]! I think a little bit of tweaking, and this will be an often-reference document. > First draft of Phoenix Tuning Guide > ----------------------------------- > > Key: PHOENIX-3218 > URL: https://issues.apache.org/jira/browse/PHOENIX-3218 > Project: Phoenix > Issue Type: Improvement > Reporter: Peter Conrad > Attachments: Phoenix-Tuning-Guide-20170110.md, > Phoenix-Tuning-Guide.md, Phoenix-Tuning-Guide.md > > > Here's a first draft of a Tuning Guide for Phoenix performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)