Re: hadoop infrastructure questions (production environment)

Steve Loughran Wed, 09 Feb 2011 02:38:29 -0800

On 08/02/11 15:45, Oleg Ruchovets wrote:

Hi , we are going to production and have some questions to ask:


    We are using 0.20_append  version (as I understand it is  hbase 0.90
requirement).


    1) Currently we have to process 50GB text files per day , it can grow to
150GB
           -- what is the best hadoop file size for our load and are there
suggested disk block size for that size?

depends on the #of machines and their performance. The smaller theblocks, the better the #of maps that can be assigned blocks, but it putsmore load on the namenode and job tracker

           -- We worked using gz and I saw that for every files 1 map task
was assigned.
                   What is the best practice:  to work with gz files and save
disc space or work without archiving ?

Hadoop sequence files can be compressed on a per-block basis. It's notas efficient as gz, but reduces your storage and network load.

                   Lets say we want to get performance benefits and disk
space is less critical.

    2)  Currently adding additional machine to the greed we need manually
maintain all files and configurations.
          Is it possible to auto-deploy hadoop servers without the need to
manually define each one on all nodes?

That's the only way people do it in production clusters: you useConfiguration Management (CM) tools. Which one you use is your choice,but do use one.



    3) Can we change masters without reinstalling the entire grid

-if you can push out a new configuration and restart the workers, youcan move the master nodes to any machine in the cluster after a failure.

-if you want to leave the nn and JT hostnames the same but change IPaddresses, you need to restart all the workers, and make sure the DNSentries of the master nodes are set to expire rapidly so the OS doesn'tcache it for long.

-if you have machines set up with the same hostname and IP addresses,then you can bring them up as the masters, just have the namenoderecover the edit log.

Re: hadoop infrastructure questions (production environment)

Reply via email to