Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Karthik Kumar
Hi,

Thanks a lot for your timely help. Your valuable answers helped us to
understand what kind of hardware to use when it comes to huge data.

With Regards,
Karthik

On 7/6/11, Steve Loughran  wrote:
> On 06/07/11 13:18, Michel Segel wrote:
>> Wasn't the answer 42?  ;-P
>
>
> 42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the
> worker nodes
>
>> Looking at your calc...
>> You forgot to factor in the number of slots per node.
>> So the number is only a fraction. Assume 10 slots per node. (10 because it
>> makes the math easier.)
>   
> I thought something was wrong. Then I thought of the server revenue and
> decided not to look that hard.
>


-- 
With Regards,
Karthik


Re: Can't start the namenode

2011-07-06 Thread Mark Kerzner
I kind of found the problem. If I open the logs directory, I see that this
log file is created by hdfs

-rw-r--r-- 1 hdfs hdfs 1399 Jul  6 21:48
hadoop-hadoop-namenode-myservername.log

whereas the rest of the logs are created by root, and they have no problem
doing this.

I can adjust permissions on the logs directory, but I would expect this
automatics.

On Wed, Jul 6, 2011 at 11:38 PM, Mark Kerzner  wrote:

> Hi,
>
> when I am trying to start a namenode in pseudo-mode
>
> sudo /etc/init.d/hadoop-0.20-namenode start
>
>
> I get a permission error
>
>
> java.io.FileNotFoundException: 
> /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log (Permission 
> denied)
>
>
> However, it does create another log file in the same directory
>
>
> ls /usr/lib/hadoop-0.20/logs
>
> hadoop-hadoop-namenode-myservername.out
>
>
> I am using CDH3, what am I doing wrong?
>
>
> Thank you,
>
> Mark
>
>


Can't start the namenode

2011-07-06 Thread Mark Kerzner
Hi,

when I am trying to start a namenode in pseudo-mode

sudo /etc/init.d/hadoop-0.20-namenode start


I get a permission error


java.io.FileNotFoundException:
/usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log
(Permission denied)


However, it does create another log file in the same directory


ls /usr/lib/hadoop-0.20/logs

hadoop-hadoop-namenode-myservername.out


I am using CDH3, what am I doing wrong?


Thank you,

Mark


Re: tar or hadoop archive

2011-07-06 Thread Manhee Jo
do you know how to set the number of map/reduce tasks rather than 1 during 
hadoop archiving?
i've tried -Dmapred.map.tasks=2 (we are using 0.19.2 actually :( ) but in 
vain.


thanks,
manhee

- Original Message - 
From: "Joey Echeverria" 

To: 
Sent: Tuesday, June 28, 2011 8:46 AM
Subject: Re: tar or hadoop archive



Yes, you can see a picture describing HAR files in this old blog post:

http://www.cloudera.com/blog/2009/02/the-small-files-problem/

-Joey

On Mon, Jun 27, 2011 at 4:36 PM, Rita  wrote:

So, it does an index of the file?



On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria  
wrote:



The advantage of a hadoop archive files is it lets you access the
files stored in it directly. For example, if you archived three files
(a.txt, b.txt, c.txt) in an archive called foo.har. You could cat one
of the three files using the hadoop command line:

hadoop fs -cat har:///user/joey/out/foo.har/a.txt

You can also copy files out of the archive or use files in the archive
as input to map reduce jobs.

-Joey

On Mon, Jun 27, 2011 at 3:06 AM, Rita  wrote:
> We use hadoop/hdfs to archive data. I archive a lot of file by 
> creating

one
> large tar file and then placing to hdfs. Is it better to use hadoop
archive
> for this or is it essentially the same thing?
>
> --
> --- Get your facts first, then you can distort them as you please.--
>



--
Joseph Echeverria
Cloudera, Inc.
443.305.9434





--
--- Get your facts first, then you can distort them as you please.--





--
Joseph Echeverria
Cloudera, Inc.
443.305.9434






Setting up users on for CDH3

2011-07-06 Thread Mark Kerzner
-- Forwarded message --
From: "Mark Kerzner" 
Date: Jul 6, 2011 6:13 PM
Subject: Setting up users on for CDH3
To: "CDH Users" 

Hi,

what are the best practices to set up user accounts for CDH3? I know that
CDH creates separate accounts for hdfs, mapred, etc. Do I need to set up a
password-less ssh for hdfs? for mapred? Are there tools that automate that?

Thank you,
Mark


Re: Job Priority Hadoop 0.20.203

2011-07-06 Thread Allen Wittenauer

On Jul 6, 2011, at 5:22 AM, Nitin Khandelwal wrote:

> Hi,
> 
> I am using Hadoop 0.20.203 with the new API ( mapreduce package) . I want to
> use Jobpriority, but unfortunately there is no option to set that in Job (
> the option is there in 0.21.0). Can somebody plz tell me is there is a
> walkaround to set job priority?


 Job priority is slowly (read: unofficially) on its way to getting deprecated, 
if one takes the fact that cap sched now completely ignores it in 203.  I, too, 
am sad about this.

Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran

On 06/07/11 13:18, Michel Segel wrote:

Wasn't the answer 42?  ;-P



42 = 40 + NN +2ary NN, assuming the JT runs on 2ary or on one of the 
worker nodes



Looking at your calc...
You forgot to factor in the number of slots per node.
So the number is only a fraction. Assume 10 slots per node. (10 because it 
makes the math easier.)


I thought something was wrong. Then I thought of the server revenue and 
decided not to look that hard.


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread M. C. Srivas
We ran the following on a 10+1 machine cluster (2-quad core, 24G DRAM,
12x2TB drives, 2 NICs each) running the 0.20.2  release

- 3.5TB terasort took ~4.5 hrs
- 10TB terasort  took ~12.5 hrs
- 20TB terasort  took > 24hrs

So yeah, Hadoop can handle it. If you want faster times, you'll have to try
- adding more machines
- using some other distro
- or both

On Wed, Jul 6, 2011 at 3:43 AM, Karthik Kumar wrote:

> Hi,
>
> Has anyone here used hadoop to process more than 3TB of data? If so we
> would like to know how many machines you used in your cluster and
> about the hardware configuration. The objective is to know how to
> handle huge data in Hadoop cluster.
>
> --
> With Regards,
> Karthik
>


Re: One file per mapper

2011-07-06 Thread Edward Capriolo
On Tue, Jul 5, 2011 at 5:28 PM, Jim Falgout wrote:

> I've done this before by placing the name of each file to process into a
> single file (newline separated) and using the NLineInputFormat class as the
> input format. Run your job with the single file with all of the file names
> to process as the input. Each mapper will then be handed one line (this is
> tunable) from the single input file. The line will contain the name of the
> file to process.
>
> You can also write your own InputFormat class that creates a split for each
> file.
>
> Both of these options have scalability issues which begs the question: why
> one file per mapper?
>
> -Original Message-
> From: Govind Kothari [mailto:govindkoth...@gmail.com]
> Sent: Tuesday, July 05, 2011 3:04 PM
> To: common-user@hadoop.apache.org
> Subject: One file per mapper
>
> Hi,
>
> I am new to hadoop. I have a set of files and I want to assign each file to
> a mapper. Also in mapper there should be a way to know the complete path of
> the file. Can you please tell me how to do that ?
>
> Thanks,
> Govind
>
> --
> Govind Kothari
> Graduate Student
> Dept. of Computer Science
> University of Maryland College Park
>
> <---Seek Excellence, Success will Follow --->
>
>
You can also do this with MultipleInputs and MultipleOutputs classes. Each
source file can have a different mapper.


Re: Job Priority Hadoop 0.20.203

2011-07-06 Thread Harsh J
Nitin,

Workaround is to set "mapred.job.priority" to the JobPriority Enum
string (#toString is sufficient) in your Job's Configuration instance.

On Wed, Jul 6, 2011 at 5:52 PM, Nitin Khandelwal
 wrote:
> Hi,
>
> I am using Hadoop 0.20.203 with the new API ( mapreduce package) . I want to
> use Jobpriority, but unfortunately there is no option to set that in Job (
> the option is there in 0.21.0). Can somebody plz tell me is there is a
> walkaround to set job priority?
>
> Thanks,
>
> --
>
> Nitin Khandelwal
>



-- 
Harsh J


Job Priority Hadoop 0.20.203

2011-07-06 Thread Nitin Khandelwal
Hi,

I am using Hadoop 0.20.203 with the new API ( mapreduce package) . I want to
use Jobpriority, but unfortunately there is no option to set that in Job (
the option is there in 0.21.0). Can somebody plz tell me is there is a
walkaround to set job priority?

Thanks,

-- 

Nitin Khandelwal


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Michel Segel
Wasn't the answer 42?  ;-P

Looking at your calc...
You forgot to factor in the number of slots per node.
So the number is only a fraction. Assume 10 slots per node. (10 because it 
makes the math easier.)

Then you need only 300 machines. You could then name your cluster lambda. 
(another literary reference...)

300 machines is a manageable cluster.

I agree that the initial question is vague and the only true answer is 'it 
depends...'
But if they want to build out a cluster of 300 machines... I've gotta guy... :-)



Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 6, 2011, at 6:32 AM, Steve Loughran  wrote:

> On 06/07/11 11:43, Karthik Kumar wrote:
>> Hi,
>> 
>> Has anyone here used hadoop to process more than 3TB of data? If so we
>> would like to know how many machines you used in your cluster and
>> about the hardware configuration. The objective is to know how to
>> handle huge data in Hadoop cluster.
>> 
> 
> Actually, I've just thought of  simpler answer. 40. It's completely random, 
> but if said with confidence it's as valid as any other answer to your current 
> question.
> 


Re: parallel cat

2011-07-06 Thread Steve Loughran

On 06/07/11 11:08, Rita wrote:

I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
lot to pipe to various programs.

I was wondering if its possible to prefetch the data for clients with more
bandwidth. Most of my clients have 10g interface and datanodes are 1g.

I was thinking, prefetch x blocks (even though it will cost extra memory)
while reading block y. After block y is read, read the prefetched blocked
and then throw it away.

It should be used like this:


export PREFETCH_BLOCKS=2 #default would be 1
hadoop fs -pcat hdfs://namenode/verylarge file | program

Any thoughts?



Look at Russ Perry's work on doing very fast fetches from an HDFS filestore
http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf

Here the DFS client got some extra data on where every copy of every 
block was, and the client decided which machine to fetch it from. This 
made the best use of the entire cluster, by keeping each datanode busy.



-steve


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran

On 06/07/11 11:43, Karthik Kumar wrote:

Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.



Actually, I've just thought of  simpler answer. 40. It's completely 
random, but if said with confidence it's as valid as any other answer to 
your current question.


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Steve Loughran

On 06/07/11 11:43, Karthik Kumar wrote:

Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.



This is too vague a question. What do you mean "process?". Scan through 
some logs looking for values? You could do that on a single machine if 
you weren't in a rush and you have enough disks, you'd just be very IO 
bound, and to be honest HDFS needs a minimum number of machines to 
become fault tolerant. Do complex matrix operations that use lots of RAM 
and CPU? You'll need more machines.


If your cluster has a blocksize of 512MB then a 3TB file fits into 
(3*1024*1024)/512 blocks: 6144. so you can't have more than 6144 
machines anyway -that's your theoretical maximum, even if your name is 
Facebook or Yahoo!


What you are looking for is something in between 10 and 6144, the exact 
number driven by
 -how much compute you need to do, and how fast you want it done 
(controls #of CPUs, RAM)

 -how much total HDD storage you anticipate needing
 -whether you want to do leading-edge GPU work (good performance on 
some tasks, but limited work per machine)


You can use benchmarking tools like gridmix3 to get some more data on 
the characteristics of your workload, which you can then take to your 
server supplier to say "this is what we need, what can you offer?" 
Otherwise everyone is just guessing.


Remember also that you can add more racks later, but you will need to 
plan ahead on datacentre space, power and -very importantly- how you are 
going to expand the networking. Life is simplest if everything fits into 
one rack, but if you plan to expand you need to have a roadmap of how to 
connect that rack to some new ones, which means adding fast interconnect 
between different top of rack switches. You also need to worry about how 
to get data in and out fast.



-Steve


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Harsh J
Karthik,

That's a highly process-dependent question I think -- What you would
do with the data, would determine the time it takes. No two
applications are the same in my belief.

On Wed, Jul 6, 2011 at 4:35 PM, Karthik Kumar  wrote:
> Hi,
>
> I wanted to know the time required to process huge datasets and number
> of machines used for them.
>
> On 7/6/11, Harsh J  wrote:
>> Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
>> contains information relevant to your question, if not a detailed
>> answer.
>>
>> On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar 
>> wrote:
>>> Hi,
>>>
>>> Has anyone here used hadoop to process more than 3TB of data? If so we
>>> would like to know how many machines you used in your cluster and
>>> about the hardware configuration. The objective is to know how to
>>> handle huge data in Hadoop cluster.
>>>
>>> --
>>> With Regards,
>>> Karthik
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
> --
> With Regards,
> Karthik
>



-- 
Harsh J


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Karthik Kumar
Hi,

I wanted to know the time required to process huge datasets and number
of machines used for them.

On 7/6/11, Harsh J  wrote:
> Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
> contains information relevant to your question, if not a detailed
> answer.
>
> On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar 
> wrote:
>> Hi,
>>
>> Has anyone here used hadoop to process more than 3TB of data? If so we
>> would like to know how many machines you used in your cluster and
>> about the hardware configuration. The objective is to know how to
>> handle huge data in Hadoop cluster.
>>
>> --
>> With Regards,
>> Karthik
>>
>
>
>
> --
> Harsh J
>


-- 
With Regards,
Karthik


Re: Hadoop cluster hardware details for big data

2011-07-06 Thread Harsh J
Have you taken a look at http://wiki.apache.org/hadoop/PoweredBy? It
contains information relevant to your question, if not a detailed
answer.

On Wed, Jul 6, 2011 at 4:13 PM, Karthik Kumar  wrote:
> Hi,
>
> Has anyone here used hadoop to process more than 3TB of data? If so we
> would like to know how many machines you used in your cluster and
> about the hardware configuration. The objective is to know how to
> handle huge data in Hadoop cluster.
>
> --
> With Regards,
> Karthik
>



-- 
Harsh J


Hadoop cluster hardware details for big data

2011-07-06 Thread Karthik Kumar
Hi,

Has anyone here used hadoop to process more than 3TB of data? If so we
would like to know how many machines you used in your cluster and
about the hardware configuration. The objective is to know how to
handle huge data in Hadoop cluster.

-- 
With Regards,
Karthik


parallel cat

2011-07-06 Thread Rita
I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat a
lot to pipe to various programs.

I was wondering if its possible to prefetch the data for clients with more
bandwidth. Most of my clients have 10g interface and datanodes are 1g.

I was thinking, prefetch x blocks (even though it will cost extra memory)
while reading block y. After block y is read, read the prefetched blocked
and then throw it away.

It should be used like this:


export PREFETCH_BLOCKS=2 #default would be 1
hadoop fs -pcat hdfs://namenode/verylarge file | program

Any thoughts?










-- 
--- Get your facts first, then you can distort them as you please.--