when u store data to hdfs it will split in 64 MB chunks automaticaly usse this to create the no of mappers u want as per size in bytes now u can read each file and can use split function
FileInputFormat.setMaxInputSplitSize(job, 2097152); FileInputFormat.setMinInputSplitSize(job, 1048576); now u can read each file and can use split function as String record = line.split(","); On Thu, Jun 14, 2012 at 10:56 AM, Harsh J <ha...@cloudera.com> wrote: > You may use TextInputFormat with "textinputformat.record.delimiter" > config set to the character you use. This feature is available in the > Apache Hadoop 2.0.0 release (and perhaps in other distributions that > carry backports). > > In case you don't have a Hadoop cluster with this feature > (MAPREDUCE-2254), you can read up on how \n is handled and handle your > files in the same way (swapping \n in LineReader with your character, > essentially what the above feature does): > http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for > the logic) > > Does this help? > > On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee > <prasen....@gmail.com> wrote: > > I have a textfile which doesn't have any newline characters. The > > records are separated by a special character ( e.g. $ ). if I push a > > single file of 5 GB to hdfs, how will it identify the boundaries on > > which the files should be split ? > > > > What are the options I have in such scenaion so that I can run mapreduce > jobs : > > > > 1. Replace record-separator with new line ? ( Not very convincing as I > > have newline in the data ) > > > > 2. Create 64MB chunks by some preprocessing ? ( Would love to know if > > it can be avoided ) > > > > 3. I can definitely write my customloader for my mapreduce jobs, but > > even then is it possible to reach out across hdfs nodes if the files > > are not aligned with recoird boundaries ? > > > > Thanks, > > Prasenjit > > > > -- > > Sent from my mobile device > > > > -- > Harsh J > -- Thanks & Regards Sachin Aggarwal 7760502772