Reading large HDFS files using AbstractBlockReader & ReaderContext

Yogi Devendra Fri, 15 Jan 2016 08:11:18 -0800

Hi Apexers,

My use-case needs to read records line-by-line from large HDFS files in
parallel.


Looking at the source code for ReaderContext.LineReaderContext in malhar
and related documentation at
http://docs.datatorrent.com/operators/block_reader/; it seems to be close
match to my requirement.

Few questions regarding the same:

1. To avoid hitting disk frequently; should I set
LineReaderContext.bufferSize property to some hight value like 32MB or 128
MB?

2. I saw following code comment in the LineReadeContext.readEntity().

//Implemented a buffered reader instead of using java's BufferedReader
because it was reading much ahead of block boundary

//and faced issues with duplicate records. Controlling the buffer size
didn't help either.

May I know some more details about what all buffer sizes have been tried?
and what was the issue faced regarding duplicate records because of reading
ahead?

---

Similarly, my other use-case needs to read fixed length records from large
HDFS files in parallel.

I found ReaderContext.FixedBytesReaderContext in malhar which relates to my
requirement.

As I understand it, FixedBytesReaderContext.length property controls the
entity length. But, there is no direct control over bufferSize in this case.

Question regarding this:

Does it mean that FixedBytesReaderContext (as per its current
implementation) will invoke separate read call for each Entity? Is there
any scope for performance improvements by reading large chunk of data in
single read and then splitting it into multiple records (entities)?

~ Yogi

Reading large HDFS files using AbstractBlockReader & ReaderContext

Reply via email to