[ 
https://issues.apache.org/jira/browse/PHOENIX-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823287#comment-15823287
 ] 

Josh Elser commented on PHOENIX-3218:
-------------------------------------

{noformat}
especially as it affects the underlying HBase row keys
{noformat}

(look-back) "Apache HBase" for the first reference, please.

{noformat}
If you do lots of random gets, make sure you use SSDs, or that your working set 
fits into RAM (either OS cache or the HBase block cache), or performance will 
be truly terrible
{noformat}

I feel like this is one of those statements that will end up causing a lot of 
"fud" to undo. HBase is *still* performing (essentially) {{log(n)}} lookups to 
find the key which is "fast". I feel like this would just promote the 
impression that Phoenix doesn't work on well for random reads without SSDs or 
high-memory systems which is wrong. Could you re-word this to be a 
recommendation that random-read workloads benefit from fast disks/large block 
caches over sequential read workloads?

{noformat}
* Use multiple indexes to provide fast access to common queries.
* Create global indexes. This will affect write speed depending on the number 
of columns included in an index because each index writes to its own separate 
table.
{noformat}

Switch the ordering here to make more sense. Point 1 is that indexes are good 
and that you should use them. Point 2 is that sometime having multiple indexes 
is a good thing.

{noformat}
    * When specifying machines for HBase, do not skimp on cores; HBase needs 
them.
{noformat}

How can this be made into a more concrete recommendation? Do you have any 
recommendations to make WRT types of disk and amount of memory available?

{noformat}
* Create additional indexes to support common query patterns, including all 
fields that need to be retrieved.
{noformat}

A bit duplicative of the above section. Perhaps reword this to be more focused 
on ensuring indexes exist for columns that don't exist solely in the primary 
key constraint but are heavily accessed? That isn't entirely correct either, 
but maybe closer..

{noformat}
if a region server goes down
{noformat}

Recommend: s/goes down/fails/

{noformat}
Set the `UPDATE_CACHE_FREQUENCY` 
[option](http://phoenix.apache.org/language/index.html#options) to 15 minutes 
or so if your metadata doesn't change very often
{noformat}

Don't guess, make a concrete recommendation. If 15minutes isn't a good 
recommendation, let's come up with a good number. Should "metadata" be "table 
schema"? Does it also include table properties (such as immutable_rows)?

{noformat}
On AWS, you'll need to manually start the job
{noformat}

Why the mention of "On AWS"? This is the same for an on-prem cluster with async 
indexes, no?

{noformat}
facilitates skip-scanning
{noformat}

Link to the docs on skip-scans.

{noformat}
For example, if you need indexes to stay in sync with data tables even if 
machines go down and writes fail, then you should consider your data 
transactional
{noformat}

This is misleading as Phoenix maintains referential integrity in the face of RS 
failure. A better use-case for transactions would be for cross-row updates to a 
data-table.

{noformat}
* Schema Design
* Indexes
* Explain Plans and Hints
* Queries
{noformat}

Links to the documentation?

{noformat}
Each row has a key, a byte-array by which rows in HBase are sorted to make 
queries faster. All table accesses are via the row key (the table's primary key)
{noformat}

It would be better to be very explicit in the use of terminology to avoid 
confusion. e.g. "An HBase row is a collection of many key-value pairs in which 
the rowkey attribute of the keys are equal. Data in an HBase table is sorted by 
the rowkey." Also s/row key/rowkey/.

{noformat}
If some columns are accessed more frequently than others, use column families 
to separate the frequently-accessed columns from rarely-accessed columns. This 
improves performance because HBase reads only the column families specified in 
the query.
{noformat}

Link to the docs on how to do this.

{noformat}
stores a copy of some or all of the data in the main table
{noformat}

Stores a pivoted copy

{noformat}
See also: 
https://phoenix.apache.org/secondary_indexing.html
{noformat}

Linkify

{noformat}
don't require you to change your queries at all—they just make them run faster
{noformat}

Suggest: "don't require change to existing queries -- queries simply run faster"

{noformat}
The sweet spot is generally a handful of secondary indexes
{noformat}

Suggest, remove the colloquialism. Doesn't match the rest of the tone of the 
document.

{noformat}
Depending on your needs, consider creating *covered* indexes or *functional* 
indexes, or both.
{noformat}

Link to docs on covered and functional indexes, please

{noformat}
If you regularly scan large data sets from spinning disk, you're best off with 
GZIP (but watch write speed)
{noformat}

Numbers/reference-material to back this up?

{noformat}
For Gets it is quite important to have your data set cached, and you should use 
the HBase block cache. 
{noformat}

Flip-flopping between "scans"/"gets" and "range queries"/"point lookups". I 
think using the latter terminology universally would be better.

{noformat}
When using `UPSERT` to write a large number of records, turn off autocommit and 
batch records. Start with a batch size of 1000 and adjust as needed. Here's 
some pseudocode showing one way to commit records in batches:
{noformat}

Recommend putting a caveat here that the use of {{commit()}} by Phoenix to 
control batches of data written to HBase as being "non-standard" in terms of 
JDBC. The {{executeBatch()}} APIs calls would be the standard way to batch 
updates to the database for other JDBC drivers. Would recommend that we at 
least acknowledge that Phoenix is doing it "differently".

{noformat}
Otherwise, replication triples the cost of each write
{noformat}

This is inaccurate, misleading at best. Whether the data is written to a 
secondary index or a local index, the underlying data is *still* stored at the 
configured HDFS replication rate (3x by default). The performance gain is that 
the RegionServer is not updating another Region for the data-table update.

{noformat}
When deleting a large data set, turn on autoCommit before issuing the `DELETE` 
query so that the client does not need to remember the row keys of all the keys 
as they are deleted.
{noformat}

Reasoning behind this one isn't clear to me. Batching DELETEs would have the 
same benefit of batching UPSERTs, no? (I may just be missing an implementation 
detail here..)

The explain section is *fantastic*. Great job there. Overall, this is a very 
nice write-up you've put together, [~pconrad]! I think a little bit of 
tweaking, and this will be an often-reference document.

> First draft of Phoenix Tuning Guide
> -----------------------------------
>
>                 Key: PHOENIX-3218
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3218
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Peter Conrad
>         Attachments: Phoenix-Tuning-Guide-20170110.md, 
> Phoenix-Tuning-Guide.md, Phoenix-Tuning-Guide.md
>
>
> Here's a first draft of a Tuning Guide for Phoenix performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to