Re: How hdfs splits blocks on record boundaries

Harsh J Wed, 13 Jun 2012 22:26:52 -0700

You may use TextInputFormat with "textinputformat.record.delimiter"
config set to the character you use. This feature is available in the
Apache Hadoop 2.0.0 release (and perhaps in other distributions that
carry backports).


In case you don't have a Hadoop cluster with this feature
(MAPREDUCE-2254), you can read up on how \n is handled and handle your
files in the same way (swapping \n in LineReader with your character,
essentially what the above feature does):
http://wiki.apache.org/hadoop/HadoopMapReduce (See the Map section for
the logic)

Does this help?

On Thu, Jun 14, 2012 at 6:41 AM, prasenjit mukherjee
<prasen....@gmail.com> wrote:
> I have a textfile which doesn't have any newline characters. The
> records are separated by a special character ( e.g. $ ). if I push a
> single file of 5 GB to hdfs, how will it identify the boundaries on
> which the files should be split ?
>
> What are the options I have in such scenaion so that I can run mapreduce jobs 
> :
>
> 1. Replace record-separator with new line ? ( Not very convincing as I
> have newline in the data )
>
> 2. Create 64MB chunks by some preprocessing ? ( Would love to know if
> it can be avoided )
>
> 3. I can definitely write my customloader for my mapreduce jobs, but
> even then is it possible to reach out across hdfs nodes if the files
> are not aligned with recoird boundaries ?
>
> Thanks,
> Prasenjit
>
> --
> Sent from my mobile device



-- 
Harsh J

Re: How hdfs splits blocks on record boundaries

Reply via email to