A couple more points which I think you alluded to, Josh, but I would still
like to call out:
1) Writing of these query logs to a phoenix table should be best effort
i.e. a query definitely shouldn't fail because we encountered an issue
while writing its log
2) Writing of query logs should happen in a manner that is async to the
flow of the query i.e. a query shouldn't incur the cost of the write
happening to the query log table

Doing 2) will help out with 1)

On Fri, Mar 2, 2018 at 2:28 PM, Josh Elser <els...@apache.org> wrote:

> Thanks Nick and Andrew! These are great points.
> * A TTL out of the box is a must. That's such a good suggestion
> * Sensitivity of data being stored is also a tricky-serious issue to
> consider. We'll want to lock the table down and be able to state very
> clearly what data may show up in it.
> * I like the "levels" of detail that will be persisted. It will help break
> up the development work (e.g. first impl can just be the INFO details), and
> prevents concern of runtime impact.
> * Sampling is a no-brainer for "always-on" situations. I like that too.
> I'll work on taking these (and others) and updating the gdoc tonight.
> Thanks again for your feedback!
> On 3/2/18 1:50 PM, Andrew Purtell wrote:
>> Agree with Nick's points but let me augment with an additional suggestion:
>> Tunable/configurable threshold for sampling. In many cases it's sufficient
>> to sample e.g. 1% of queries to get sufficient coverage and this would
>> prune 99% of actual load from the query log.
>> Also let me underline that compliance requirements will require either
>> super strong controls of the query log if everything is always logged, in
>> which case it is important that it works well with access control features
>> to lock it down; or better what Nick suggests where we can turn off things
>> like logging the values supplied for bound parameters.
>> On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <ndimi...@gmail.com> wrote:
>> I'm a big fan of this idea. There was a brief discussion on the topic over
>>> on PHOENIX-2715.
>>> My first concern is that the collected information is huge -- easily far
>>> larger than the user data for a busy cluster. For instance, a couple 10's
>>> of GB stored user data, guideposts set to default 100mb, enable salting
>>> on
>>> a table with an "innocent" value of 10 or 20 and the collection of RPCs
>>> can
>>> easily grow into the hundreds for simple queries. Even if you catalog
>>> just
>>> the "logical" RPC's - HBase Client API calls that Phoenix plans rather
>>> than
>>> the underlying HBase Client RPCs - this will be quite large. The
>>> guidepost
>>> themselves for such a table would be on the order of 30mb.
>>> My next concern is about the sensitive query parameters being stored.
>>> It's
>>> entirely reasonable to expect a table to store sensitive information that
>>> should not be exposed to operations.
>>> Thus, my suggestions:
>>> * minimize the unbounded nature of this table by truncating all columns
>>> to
>>> some max length -- perhaps 5k or 10k.
>>> * enable a default TTL on the schema. 7 days seems like a good starting
>>> point.
>>> * consider controlling which columns are populated via some operational
>>> mechanism. Use Logger level as an example, with INFO the default setting.
>>> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
>>> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
>>> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>>> Thanks,
>>> Nick
>>> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <els...@apache.org> wrote:
>>> Hiya,
>>>> I wanted to share this little design doc with you about some feature
>>>> work
>>>> we've been thinking about. The following is a Google doc in which anyone
>>>> should be allowed to comment. Feel free to comment there, or here on the
>>>> thread.
>>>> https://s.apache.org/phoenix-query-log
>>>> The high-level goal is to create a construct in which Phoenix clients
>>> will
>>>> automatically serialize information about the queries they run to a
>>>> table
>>>> for retrospective analysis. Ideally, this information would be stored in
>>> a
>>>> Phoenix table. We want this data to help answer questions like:
>>>> * What queries are running against my system
>>>> * What specific queries started between 535AM and 620AM two days ago
>>>> * What queries are user "bob" running
>>>> * Are my user's queries effectively using the indexes in the system
>>>> Anti-goals for include:
>>>> * Cluster impact (computation/memory) usage of a query
>>>> * Query performance may be slowed to ensure all data is serialized
>>>> * A third-party service dedicated to ensuring query info is serialized
>>> (in
>>>> the event of client failure)
>>>> Take a look at the document and let us know what you think please. I'm
>>>> happy to try to explain this in greater detail.
>>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)

Reply via email to