Rohini Palaniswamy commented on PIG-3891:

     - CHANGES.txt will be modified when committing. Need not make any changes 
to that as part of patch
     - Please revert changes to ExecType and TezMiniCluster. We can't have 
public static changed to package protected as it is already being used by 
users. Once PIG-4923 goes in, we can add TEZ and SPARK there. 
   - In TestMRJobStats, can you change "The returned output size is expected to 
be the same as the file size" to "The returned output size is expected to be 
sum of file sizes in the sub-directories"
   - We try to avoid if (Tez) else (MR) conditions as much as possible in 
tests. For testOutputStats test in TestMultiStorage, can we just do following 
asserts and put hardcoded values instead of getting values from MR and Tez 
counters. That way test is more solid.  Also please do add a FILTER statement 
for out2 to filter couple of records so that bytes and records are not same as 
Map<String, Long> multiStoreCounters = dagStats.getMultiStoreCounters();
+        PigStats stats = job.getStatistics();
+        assertEquals(HardCodedValueHere, stats.getBytesWritten());
+        List<OutputStats> outputStats = SimplePigStats.get().getOutputStats();
+        assertEquals(2, outputStats.size()); // 2 split conditions
+        assertEquals(HardCodedValueHere, outputStats.get(0).getBytes());
+        assertEquals(HardCodedValueHere, outputStats.get(1).getBytes());
+        assertEquals(HardCodedValueHere, outputStats.get(0).getRecords());
+        assertEquals(HardCodedValueHere, outputStats.get(1).getRecords());
+        assertEquals(9L, multiStoreCounters.get("Output records in 
+        assertEquals(9L, multiStoreCounters.get("Output records in 

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -----------------------------------------------------------------------------
>                 Key: PIG-3891
>                 URL: https://issues.apache.org/jira/browse/PIG-3891
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Nandor Kollar
>         Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.

This message was sent by Atlassian JIRA

Reply via email to