[ 
https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024218#comment-16024218
 ] 

liyunzhang_intel commented on PIG-5135:
---------------------------------------

[~szita]:
bq.I've checked this, it seems that assertEquals(30, 
inputStats.get(0).getBytes()); is fine, but assertEquals(18, 
inputStats.get(1).getBytes()); is not true, Spark returns -1 here. The plan 
generated for spark consists of 4 jobs, last one being the responsible for 
replicated join. This latter does 3 loads, and thus SparkPigStats handle this 
as -1. (Even after adding together all the bytes from all load ops in this job 
I got different result than 18.) I guess compression is also at work here on 
the tmp file part generation that further alters the number of bytes being read.
org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3
{code}
#--------------------------------------------------
# Spark Plan                                  
#--------------------------------------------------

Spark node scope-53
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-54
|
|---A: New For Each(false,false,false)[bag] - scope-10
    |   |
    |   Cast[int] - scope-2
    |   |
    |   |---Project[bytearray][0] - scope-1
    |   |
    |   Cast[int] - scope-5
    |   |
    |   |---Project[bytearray][1] - scope-4
    |   |
    |   Cast[int] - scope-8
    |   |
    |   |---Project[bytearray][2] - scope-7
    |
    |---A: 
Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) 
- scope-0--------

Spark node scope-55
Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-56
|
|---C: Filter[bag] - scope-14
    |   |
    |   Less Than or Equal[boolean] - scope-17
    |   |
    |   |---Project[int][1] - scope-15
    |   |
    |   |---Constant(5) - scope-16
    |
    
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-10--------

Spark node scope-57
C: 
Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage)
 - scope-21
|
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-14--------

Spark node scope-65
D: 
Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage)
 - scope-52
|
|---D: FRJoinSpark[tuple] - scope-44
    |   |
    |   Project[int][0] - scope-41
    |   |
    |   Project[int][0] - scope-42
    |   |
    |   Project[int][0] - scope-43
    |
    
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage)
 - scope-58
    |
    |---BroadcastSpark - scope-63
    |   |
    |   |---B: Filter[bag] - scope-26
    |       |   |
    |       |   Equal To[boolean] - scope-29
    |       |   |
    |       |   |---Project[int][0] - scope-27
    |       |   |
    |       |   |---Constant(3) - scope-28
    |       |
    |       
|---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage)
 - scope-60
    |
    |---BroadcastSpark - scope-64
        |
        |---A1: New For Each(false,false,false)[bag] - scope-40
            |   |
            |   Cast[int] - scope-32
            |   |
            |   |---Project[bytearray][0] - scope-31
            |   |
            |   Cast[int] - scope-35
            |   |
            |   |---Project[bytearray][1] - scope-34
            |   |
            |   Cast[int] - scope-38
            |   |
            |   |---Project[bytearray][2] - scope-37
            |
            |---A1: 
Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) 
- scope-30--------
{code}
 assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode,
 assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the 
there are 3 loads in {{Spark node scope-65}}.  
[{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93]
 returns 49( guess this is the sum of 
three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current 
[{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91]
 is -1 because 
[{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92]
 is false.


Let's modify the code like
{code}

      // Since Tez does has only one load per job its values are correct
            // the result of inputStats in spark mode is also correct
          if (!Util.isMapredExecType(cluster.getExecType())) {
            assertEquals(30, inputStats.get(0).getBytes());
          }

          //TODO PIG-5240:Fix TestPigRunner#simpleMultiQueryTest3 in spark mode 
for wrong inputStats
          if (!Util.isMapredExecType(cluster.getExecType()) && 
!Util.isSparkExecType(cluster.getExecType())) {
            assertEquals(18, inputStats.get(1).getBytes());
          }
{code}


> HDFS bytes read stats are always 0 in Spark mode
> ------------------------------------------------
>
>                 Key: PIG-5135
>                 URL: https://issues.apache.org/jira/browse/PIG-5135
>             Project: Pig
>          Issue Type: Bug
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: Adam Szita
>             Fix For: spark-branch
>
>         Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch
>
>
> I discovered this while running TestOrcStoragePushdown unit test in Spark 
> mode where the test depends on the value of this stat.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to