[ https://issues.apache.org/jira/browse/PIG-5135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024218#comment-16024218 ]
liyunzhang_intel commented on PIG-5135: --------------------------------------- [~szita]: bq.I've checked this, it seems that assertEquals(30, inputStats.get(0).getBytes()); is fine, but assertEquals(18, inputStats.get(1).getBytes()); is not true, Spark returns -1 here. The plan generated for spark consists of 4 jobs, last one being the responsible for replicated join. This latter does 3 loads, and thus SparkPigStats handle this as -1. (Even after adding together all the bytes from all load ops in this job I got different result than 18.) I guess compression is also at work here on the tmp file part generation that further alters the number of bytes being read. org.apache.pig.test.TestPigRunner#simpleMultiQueryTest3 {code} #-------------------------------------------------- # Spark Plan #-------------------------------------------------- Spark node scope-53 Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-54 | |---A: New For Each(false,false,false)[bag] - scope-10 | | | Cast[int] - scope-2 | | | |---Project[bytearray][0] - scope-1 | | | Cast[int] - scope-5 | | | |---Project[bytearray][1] - scope-4 | | | Cast[int] - scope-8 | | | |---Project[bytearray][2] - scope-7 | |---A: Load(hdfs://localhost:58892/user/root/input:org.apache.pig.builtin.PigStorage) - scope-0-------- Spark node scope-55 Store(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-56 | |---C: Filter[bag] - scope-14 | | | Less Than or Equal[boolean] - scope-17 | | | |---Project[int][1] - scope-15 | | | |---Constant(5) - scope-16 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-10-------- Spark node scope-57 C: Store(hdfs://localhost:58892/user/root/output:org.apache.pig.builtin.PigStorage) - scope-21 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-14-------- Spark node scope-65 D: Store(hdfs://localhost:58892/user/root/output2:org.apache.pig.builtin.PigStorage) - scope-52 | |---D: FRJoinSpark[tuple] - scope-44 | | | Project[int][0] - scope-41 | | | Project[int][0] - scope-42 | | | Project[int][0] - scope-43 | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp-546700946:org.apache.pig.impl.io.InterStorage) - scope-58 | |---BroadcastSpark - scope-63 | | | |---B: Filter[bag] - scope-26 | | | | | Equal To[boolean] - scope-29 | | | | | |---Project[int][0] - scope-27 | | | | | |---Constant(3) - scope-28 | | | |---Load(hdfs://localhost:58892/tmp/temp-1660154197/tmp1818797386:org.apache.pig.impl.io.InterStorage) - scope-60 | |---BroadcastSpark - scope-64 | |---A1: New For Each(false,false,false)[bag] - scope-40 | | | Cast[int] - scope-32 | | | |---Project[bytearray][0] - scope-31 | | | Cast[int] - scope-35 | | | |---Project[bytearray][1] - scope-34 | | | Cast[int] - scope-38 | | | |---Project[bytearray][2] - scope-37 | |---A1: Load(hdfs://localhost:58892/user/root/input2:org.apache.pig.builtin.PigStorage) - scope-30-------- {code} assertEquals(30, inputStats.get(0).getBytes()) is correct in spark mode, assertEquals(18, inputStats.get(1).getBytes()) is wrong in spark mode as the there are 3 loads in {{Spark node scope-65}}. [{{stats.get("BytesRead")}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L93] returns 49( guess this is the sum of three loads({{input2}},{{tmp1818797386}},{{tmp-546700946}}). But current [{{bytesRead}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L91] is -1 because [{{singleInput}}|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L92] is false. Let's modify the code like {code} // Since Tez does has only one load per job its values are correct // the result of inputStats in spark mode is also correct if (!Util.isMapredExecType(cluster.getExecType())) { assertEquals(30, inputStats.get(0).getBytes()); } //TODO PIG-5240:Fix TestPigRunner#simpleMultiQueryTest3 in spark mode for wrong inputStats if (!Util.isMapredExecType(cluster.getExecType()) && !Util.isSparkExecType(cluster.getExecType())) { assertEquals(18, inputStats.get(1).getBytes()); } {code} > HDFS bytes read stats are always 0 in Spark mode > ------------------------------------------------ > > Key: PIG-5135 > URL: https://issues.apache.org/jira/browse/PIG-5135 > Project: Pig > Issue Type: Bug > Components: spark > Reporter: liyunzhang_intel > Assignee: Adam Szita > Fix For: spark-branch > > Attachments: PIG-5135.0.patch, PIG-5135.1.patch, PIG-5135.2.patch > > > I discovered this while running TestOrcStoragePushdown unit test in Spark > mode where the test depends on the value of this stat. -- This message was sent by Atlassian JIRA (v6.3.15#6346)