I have a textfile which doesn't have any newline characters. The records are separated by a special character ( e.g. $ ). if I push a single file of 5 GB to hdfs, how will it identify the boundaries on which the files should be split ?
What are the options I have in such scenaion so that I can run mapreduce jobs : 1. Replace record-separator with new line ? ( Not very convincing as I have newline in the data ) 2. Create 64MB chunks by some preprocessing ? ( Would love to know if it can be avoided ) 3. I can definitely write my customloader for my mapreduce jobs, but even then is it possible to reach out across hdfs nodes if the files are not aligned with recoird boundaries ? Thanks, Prasenjit -- Sent from my mobile device