Re: hadoop infrastructure questions (production environment)

Allen Wittenauer Wed, 09 Feb 2011 10:34:29 -0800

On Feb 8, 2011, at 7:45 AM, Oleg Ruchovets wrote:

> Hi , we are going to production and have some questions to ask:
> 
>   We are using 0.20_append  version (as I understand it is  hbase 0.90
> requirement).
> 
> 
>   1) Currently we have to process 50GB text files per day , it can grow to
> 150GB
>          -- what is the best hadoop file size for our load and are there
> suggested disk block size for that size?


        What is the retention policy?  You'll likely want something bigger than 
the default 64mb though if the plan is "indefinitely" and you process these 
files at every job run.

>          -- We worked using gz and I saw that for every files 1 map task
> was assigned.

        Correct.  gzip files are not splittable.

>                  What is the best practice:  to work with gz files and save
> disc space or work without archiving ?

        I'd recommend converting them to something else (such as a 
SequenceFile) that can be block compressed.

>                  Lets say we want to get performance benefits and disk
> space is less critical.

        Then uncompress them. :D


>   2)  Currently adding additional machine to the greed we need manually
> maintain all files and configurations.
>         Is it possible to auto-deploy hadoop servers without the need to
> manually define each one on all nodes?

        We basically use bcfg2 to manage different sets of configuration files 
(NN, JT, DN/TT, and gateway).  When a node is brought up, bcfg2 detects based 
upon the disk configuration what kind of machine it is and applies the 
appropriate configurations.  

        The only files that really need to get changed when you add/subtract 
nodes are the ones on the NN and JT.  The rest of the nodes are oblivious to 
the rest.


>   3) Can we change masters without reinstalling the entire grid

        Yes.  Depending upon how you manage the service movement (see a 
previous discussion on this last week), at most you'll need to bounce the DN 
and TT processes.

Re: hadoop infrastructure questions (production environment)

Reply via email to