Re: [DISCUSS] Design for a "query log"

Andrew Purtell Fri, 02 Mar 2018 10:51:43 -0800

Agree with Nick's points but let me augment with an additional suggestion:
Tunable/configurable threshold for sampling. In many cases it's sufficient
to sample e.g. 1% of queries to get sufficient coverage and this would
prune 99% of actual load from the query log.


Also let me underline that compliance requirements will require either
super strong controls of the query log if everything is always logged, in
which case it is important that it works well with access control features
to lock it down; or better what Nick suggests where we can turn off things
like logging the values supplied for bound parameters.



On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <[email protected]> wrote:

> I'm a big fan of this idea. There was a brief discussion on the topic over
> on PHOENIX-2715.
>
> My first concern is that the collected information is huge -- easily far
> larger than the user data for a busy cluster. For instance, a couple 10's
> of GB stored user data, guideposts set to default 100mb, enable salting on
> a table with an "innocent" value of 10 or 20 and the collection of RPCs can
> easily grow into the hundreds for simple queries. Even if you catalog just
> the "logical" RPC's - HBase Client API calls that Phoenix plans rather than
> the underlying HBase Client RPCs - this will be quite large. The guidepost
> themselves for such a table would be on the order of 30mb.
>
> My next concern is about the sensitive query parameters being stored. It's
> entirely reasonable to expect a table to store sensitive information that
> should not be exposed to operations.
>
> Thus, my suggestions:
> * minimize the unbounded nature of this table by truncating all columns to
> some max length -- perhaps 5k or 10k.
> * enable a default TTL on the schema. 7 days seems like a good starting
> point.
> * consider controlling which columns are populated via some operational
> mechanism. Use Logger level as an example, with INFO the default setting.
> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>
> Thanks,
> Nick
>
> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <[email protected]> wrote:
>
> > Hiya,
> >
> > I wanted to share this little design doc with you about some feature work
> > we've been thinking about. The following is a Google doc in which anyone
> > should be allowed to comment. Feel free to comment there, or here on the
> > thread.
> >
> > https://s.apache.org/phoenix-query-log
> >
> > The high-level goal is to create a construct in which Phoenix clients
> will
> > automatically serialize information about the queries they run to a table
> > for retrospective analysis. Ideally, this information would be stored in
> a
> > Phoenix table. We want this data to help answer questions like:
> >
> > * What queries are running against my system
> > * What specific queries started between 535AM and 620AM two days ago
> > * What queries are user "bob" running
> > * Are my user's queries effectively using the indexes in the system
> >
> > Anti-goals for include:
> >
> > * Cluster impact (computation/memory) usage of a query
> > * Query performance may be slowed to ensure all data is serialized
> > * A third-party service dedicated to ensuring query info is serialized
> (in
> > the event of client failure)
> >
> > Take a look at the document and let us know what you think please. I'm
> > happy to try to explain this in greater detail.
> >
> > - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
> >
>



-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Design for a "query log"

Reply via email to