Hi Apexers, My use-case needs to read records line-by-line from large HDFS files in parallel.
Looking at the source code for ReaderContext.LineReaderContext in malhar and related documentation at http://docs.datatorrent.com/operators/block_reader/; it seems to be close match to my requirement. Few questions regarding the same: 1. To avoid hitting disk frequently; should I set LineReaderContext.bufferSize property to some hight value like 32MB or 128 MB? 2. I saw following code comment in the LineReadeContext.readEntity(). //Implemented a buffered reader instead of using java's BufferedReader because it was reading much ahead of block boundary //and faced issues with duplicate records. Controlling the buffer size didn't help either. May I know some more details about what all buffer sizes have been tried? and what was the issue faced regarding duplicate records because of reading ahead? --- Similarly, my other use-case needs to read fixed length records from large HDFS files in parallel. I found ReaderContext.FixedBytesReaderContext in malhar which relates to my requirement. As I understand it, FixedBytesReaderContext.length property controls the entity length. But, there is no direct control over bufferSize in this case. Question regarding this: Does it mean that FixedBytesReaderContext (as per its current implementation) will invoke separate read call for each Entity? Is there any scope for performance improvements by reading large chunk of data in single read and then splitting it into multiple records (entities)? ~ Yogi
