Hello Everyone, I am a newbie and need some help. I saw on Hadoop wiki that there can be projects to improve Hadoop and map-reduce performance on available benchmarks(sort etc)..
In a distributed file system environment, caching can be followed. In such systems, whenever a file access is required, the client has to check the content in the local cache with reference to the server file system. By the time server responds to this query of the client, the client can execute the requested operations on the data available in the cache. If the server responds that the client has the most recently modified file then the client can proceed with the processing otherwise it can rollback to a previous state and start with newer version of the file. This will save processing power, CPU cycles time. This can be applied to Hadoop as well. Say we are sorting a file. With map-reduce sorting can be done this way. A client requests the server about the modification time of the file and starts execution on the file it has in the cache. When server responds it can check the cached copy and proceed accordingly. Could any one please discuss whether this can be done in Hadoop or not. Is it already implemented or is anyone else working on the same. If this is not the right place to discuss then can you direct me to some other source of information. Thank You. Shruti