[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638486#comment-14638486
 ] 

kexianda commented on PIG-4634:
-------------------------------

Hi [~xuefuz],

latest patch(PIG-4634_2.patch) is attached. Please help review the code. Thanks.

1. I threw the exception just following the behavior of function 
SparkJobStats.collectStats().  I agree with you, it is too harsh. Now, logging 
message instead of throwing an exception. The exception is also removed in 
SparkJobStats.collectStats().

2. From my understanding, we should only aggregate the OutputMetrics of the 
Result stages, and ignore the OutputMetrics of ShuffleMap stages. The 
JobMetricsListener just collect all the taskMetrics of the stages. We don't 
know the taskMetrics is from ShuffleMap stage or Result stage. 
There is a tricky detail: the OutputMetrics is null in the ShuffleMapTask. And 
previous patch file use this tricky detail. It is not intuitive, hard to 
maintain.
To improve the robustness, the ID of Result stages are collected in a set in 
JobMetricsListener.java. Then, in getRecordCount(), only the Result stages's 
OutputMetrics are aggregated, the OutputMetrics of ShuffleMap stages are 
ignored.

3. How to test this?
Here are the two cases for testing this:
TestPigRunner.simpleTest()
TestPigRunner.simpleTest2() 
ant -Dhadoopversion=23 -Dexectype=spark -Dtestcase=TestPigRunner test

Regards, 
Xianda

> Fix records count issues in output statistics
> ---------------------------------------------
>
>                 Key: PIG-4634
>                 URL: https://issues.apache.org/jira/browse/PIG-4634
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: kexianda
>            Assignee: kexianda
>             Fix For: spark-branch
>
>         Attachments: PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to