[
https://issues.apache.org/jira/browse/HTRACE-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295831#comment-14295831
]
Colin Patrick McCabe commented on HTRACE-92:
--------------------------------------------
I always assumed that we would implement a special bit in the HBase / HDFS
request protobuf that hard-disables tracing, even if the sampler says it should
trace. Then we would set that bit for requests generated by trace sinks
themselves. I don't think we actually got around to doing this in HDFS... I've
been working more on the htraced trace sink more lately, which doesn't have
this "recursion" issue since it is a separate system.
Did you guys get a chance to check the code for {{TraceScope}} leaks? I am
really curious what the root cause of this issue is.
> Thread local storing the currentSpan is never cleared
> -----------------------------------------------------
>
> Key: HTRACE-92
> URL: https://issues.apache.org/jira/browse/HTRACE-92
> Project: HTrace
> Issue Type: Bug
> Affects Versions: 3.1.0
> Reporter: Samarth Jain
> Attachments: Screen Shot 2015-01-27 at 11.20.36 PM.png
>
>
> In Apache Phoenix, we use HTrace to provide request level trace information.
> The trace information (traceid, parentid, spanid, description) among other
> columns is stored in a Phoenix table.
> Spans that are MilliSpan get persisted to the phoenix table this way:
> MilliSpans-> PhoenixTraceMetricsSource-> PhoenixMetricsSink->Phoenix Table
> While inserting the traces to the phoenix table, we make sure that the upsert
> happening in sink is through a connection that doesn't have tracing on. So
> this way any spans that are created by executing these upsert statements on
> the client side are NullSpans.
> On server side too, when these batched up upsert statements are executed as
> batchMutate operations, they are not expected to have tracing on i.e. the
> current spans are null. However, we noticed that these current spans are not
> null which ends up resulting in an infinite loop. An example of such infinite
> loop is the following:
> batchmutate -> fshlog.append -> check tracing on i.e. current span is not
> null -> yes -> create milli span -> do operation -> stop span -> publish to
> metrics source -> phoenix metrics sink -> upsert statement -> batchmutate ->
> fshlog.append......
> My local cluster infact dies because of this infinite loop!!
> On examining the thread local of the threads in the RPC thread pool, I saw
> that there were threads that had current spans that were closed at least an
> hour before. See the screenshot attached.
> The screenshot was taken at 11:20 PM and the thread had a current span whose
> stop time was 10:17 PM.
> These brought up a couple of design issues/limitations in the HTrace API:
> 1) There is no good way to set (reset) the value of the thread local current
> span to null. This is a huge issue especially if we we are reusing threads
> from a thread pool.
> 2) Should we allow creating spans if the parent span is not running anymore
> i.e. Span.isRunning() is false. In the example I have shown in the screen
> shot, the current span stored in the thread local is already closed.
> Essentially making {code}
> boolean isTracing() {
> return currentSpan.get() != null && currentSpan.get().isRunning()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)