when u store data to hdfs it will split in 64 MB chunks automaticaly

usse this to create the no of mappers u want as per size in bytes
now u can read each file and can use split function

    FileInputFormat.setMaxInputSplitSize(job, 2097152);
    FileInputFormat.setMinInputSplitSize(job, 1048576);

now u can read each file and can use split function as
    String record = line.split(",");

On Thu, Jun 14, 2012 at 10:56 AM, Harsh J <ha...@cloudera.com> wrote:

> You may use TextInputFormat with "textinputformat.record.delimiter"
> config set to the character you use. This feature is available in the
> Apache Hadoop 2.0.0 release (and perhaps in other distributions that
> carry backports).
>
> In case you don't have a Hadoop cluster with this feature
> (MAPREDUCE-2254), you can read up on how \n is handled and handle your
> files in the same way (swapping \n in LineReader with your character,
> essentially what the above feature does):
> http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
> the logic)
>
> Does this help?
>
> On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
> <prasen....@gmail.com> wrote:
> > I have a textfile which doesn't have any newline characters. The
> > records are separated by a special character ( e.g. $ ). if I push a
> > single file of 5 GB to hdfs, how will it identify the boundaries on
> > which the files should be split ?
> >
> > What are the options I have in such scenaion so that I can run mapreduce
> jobs :
> >
> > 1. Replace record-separator with new line ? ( Not very convincing as I
> > have newline in the data )
> >
> > 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
> > it can be avoided )
> >
> > 3. I can definitely write my customloader for my mapreduce jobs, but
> > even then is it possible to reach out across hdfs nodes if the files
> > are not aligned with recoird boundaries ?
> >
> > Thanks,
> > Prasenjit
> >
> > --
> > Sent from my mobile device
>
>
>
> --
> Harsh J
>



-- 

Thanks & Regards

Sachin Aggarwal
7760502772

Reply via email to