[
https://issues.apache.org/jira/browse/PIG-3956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cheolsoo Park updated PIG-3956:
-------------------------------
Fix Version/s: 0.14.0
Assignee: Cheolsoo Park
Status: Patch Available (was: Open)
> UDF profile is often misleading
> -------------------------------
>
> Key: PIG-3956
> URL: https://issues.apache.org/jira/browse/PIG-3956
> Project: Pig
> Issue Type: Bug
> Components: impl
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-3956-1.patch
>
>
> Pig provides UDF profiling via the {{pig.udf.profile}} perperty. How it works
> is that it samples every 100th UDF invocation and increments a counter by 100
> x elapsed time.
> {code}
> private final static int TIMING_FREQ = 100;
> ...
> PigStatusReporter.getInstance().incrCounter(counterGroup, TIMING_COUNTER,
> Math.round((System.nanoTime() - startNanos) / 1000) * TIMING_FREQ);
> {code}
> The problem is that this assumes every invocation takes about the same time,
> which doesn't hold true oftentimes.
> Here is an example. TIMING_FREQ = 100.
> {code}
> set pig.udf.profile true;
> set opt.fetch false;
> l = LOAD 'foo' AS (value:chararray);
> f = FOREACH l GENERATE LOWER(value);
> DUMP f;
> {code}
> This job gives me the following counters-
> {code}
> org.apache.pig.builtin.LOWER - approx_invocations: 6,144,600
> org.apache.pig.builtin.LOWER - approx_microsecs: 130,913,800
> {code}
> However, if I change TIMING_FREQ to 1, I get the followings-
> {code}
> org.apache.pig.builtin.LOWER - approx_invocations: 6,144,248
> org.apache.pig.builtin.LOWER - approx_microsecs: 19,711,842
> {code}
> As can be seen, approx_microsecs is 6x larger when sampling every 100th
> invocation as compared to when sampling every invocation.
--
This message was sent by Atlassian JIRA
(v6.2#6252)