Classes implementing InputFormat implement public List<InputSplit> getSplits(JobContext job) which a List if InputSplits. for FileInputFormat the Splits have Path.start and End
1) When is this method called and on which JVM on Which Machine and is it called only once? 2) Do the number of Map task correspond to the number of splits returned by getSplits? 3) InputFormat implements a method RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context ). Is this executed within the JVM of the Mapper on the slave machine and does the RecordReader run within that JVM 4) The default RecordReaders read a file from the start position to the end position emitting values in the order read. With such a reader, assume it is reading lines of text, is it reasonable to assume that the values the mapper received are in the same order they were found in a file? Would it, for example, be possible for WordCount to see a word that was hyphen- ated at the end of one line and append the first word of the next line it sees (ignoring the case where the word is at the end of a split)