I'm working on a patch that switches this stuff out with the Hadoop FileSystem StatisticsData, which will both give an accurate count and allow us to get metrics while the task is in progress. A hitch is that it relies on https://issues.apache.org/jira/browse/HADOOP-10688, so we still might want a fallback for versions of Hadoop that don't have this API.
On Sat, Jul 26, 2014 at 10:47 AM, Reynold Xin <r...@databricks.com> wrote: > There is one piece of information that'd be useful to know, which is the > source of the input. Even in the presence of an IOException, the input > metrics still specifies the task is reading from Hadoop. > > However, I'm slightly confused by this -- I think usually we'd want to > report the number of bytes read, rather than the total input size. For > example, if there is a limit (only read the first 5 records), the actual > number of bytes read is much smaller than the total split size. > > Kay, am I mis-interpreting this? > > > > On Sat, Jul 26, 2014 at 7:42 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Hi, > > Starting at line 203: > > try { > > /* bytesRead may not exactly equal the bytes read by a task: > split > > boundaries aren't > > * always at record boundaries, so tasks may need to read into > > other splits to complete > > * a record. */ > > inputMetrics.bytesRead = split.inputSplit.value.getLength() > > } catch { > > case e: java.io.IOException => > > logWarning("Unable to get input size to set InputMetrics for > > task", e) > > } > > context.taskMetrics.inputMetrics = Some(inputMetrics) > > > > If there is IOException, context.taskMetrics.inputMetrics is set by > > wrapping inputMetrics - as if there wasn't any error. > > > > I wonder if the above code should distinguish the error condition. > > > > Cheers > > >