Re: PIG and Hive

2009-05-07 Thread Amr Awadallah

Yiping,

 (1) Any ETA for when that will become available?

 (2) Where can we read more about the SQL functionality it will support?

 (3) Where is the JIRA for this?

Thanks,

-- amr

Luc Hunt wrote:

Ricky,

One thing to mention is, SQL support is on the Pig roadmap this year.


--Yiping

On Wed, May 6, 2009 at 9:11 PM, Ricky Ho r...@adobe.com wrote:

  

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a
algorithm design / prototyping tool) to express my parallel algorithms in a
concise way.  The deeper I look into these, I have a stronger feeling that
PIG and HIVE are competitors rather than complementing each other.  I think
a large set of problems can be done in either way, without much difference
in terms of skillset requirements.

At this moment, I am focus in the richness of the language model rather
than the implementation optimization.  Supporting collection as well as
the flatten operation in the language model seems to make PIG more powerful.
 Yes, you can achieve the same thing in Hive but then it starts to look odd.
 Am I missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many
fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and
in the future they won't differ much.  But for now, Pig optimizes much
better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5
map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to
make that one pass, but those sort of performance optimizations aren't
there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?)
could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, Olga Natkovich ol...@yahoo-inc.com wrote:



Hi Ricky,

This is how the code will look in Pig.

A = load 'textdoc' using TextLoader() as (sentence: chararray);
B = foreach A generate flatten(TOKENIZE(sentence)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into 'wordcount';

Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
explains how the example above works.

Let me know if you have further questions.

Olga


  

-Original Message-
From: Ricky Ho [mailto:r...@adobe.com]
Sent: Wednesday, May 06, 2009 3:56 PM
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks Amr,

Without knowing the details of Hive, one constraint of SQL
model is you can never generate more than one records from a
single record.  I don't know how this is done in Hive.
Another question is whether the Hive script can take in
user-defined functions ?

Using the following word count as an example.  Can you show
me how the Pig script and Hive script looks like ?

Map:
  Input: a line (a collection of words)
  Output: multiple [word, 1]

Reduce:
  Input: [word, [1, 1, 1, ...]]
  Output: [word, count]

Rgds,
Ricky

-Original Message-
From: Amr Awadallah [mailto:a...@cloudera.com]
Sent: Wednesday, May 06, 2009 3:14 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive



The difference between PIG and Hive seems to be pretty
  

insignificant.

Difference between Pig and Hive is significant, specifically:

(1) Pig doesn't require underlying structure to the data,
Hive does imply structure via a metastore. This has it pros
and cons. It allows Pig to be more suitable for ETL kind
tasks where the input data is still a mish-mash and you want
to convert it to be structured. On the other hand, Hive's
metastore provides a dictionary that lets you easily see what
columns exist in which tables which can be very handy.

(2) Pig is a new language, easy to learn if you know
languages similar to Perl. Hive is a sub-set of SQL with very
simple variations to enable map-reduce like computation. So,
if you come from a SQL background you will find Hive QL
extremely easy to pickup (many of your SQL queries will run
as is), while if you come from a procedural programming
background (w/o SQL knowledge) then Pig will be much more
suitable for you. Furthermore, Hive is a bit easier to
integrate with other systems and tools since it speaks the
language they already speak (i.e. SQL).

You're right that HBase is a completely different game, HBase
is not about being a high level language that compiles to
map-reduce, HBase is about allowing Hadoop to support

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Piotr Praczyk
If You want to use many small files, they are probably having the same
purpose and struc?
Why not use HBase instead of a raw HDFS ? Many small files would be packed
together and the problem would disappear.

cheers
Piotr

2009/5/7 Jonathan Cao jonath...@rockyou.com

 There are at least two design choices in Hadoop that have implications for
 your scenario.
 1. All the HDFS meta data is stored in name node memory -- the memory size
 is one limitation on how many small files you can have

 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer
 job has enough work to offset the overhead of spawning the job.  It relies
 on each task reading contiguous chuck of data (typically 64MB), your small
 file situation will change those efficient sequential reads to larger
 number
 of inefficient random reads.

 Of course, small is a relative term?

 Jonathan

 2009/5/6 陈桂芬 chenguifen...@163.com

  Hi:
 
  In my application, there are many small files. But the hadoop is designed
  to deal with many large files.
 
  I want to know why hadoop doesn’t support small files very well and where
  is the bottleneck. And what can I do to improve the Hadoop’s capability
 of
  dealing with small files.
 
  Thanks.
 
 



Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Jeff Hammerbacher
Hey,

You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

Regards,
Jeff

2009/5/7 Piotr Praczyk piotr.prac...@gmail.com

 If You want to use many small files, they are probably having the same
 purpose and struc?
 Why not use HBase instead of a raw HDFS ? Many small files would be packed
 together and the problem would disappear.

 cheers
 Piotr

 2009/5/7 Jonathan Cao jonath...@rockyou.com

  There are at least two design choices in Hadoop that have implications
 for
  your scenario.
  1. All the HDFS meta data is stored in name node memory -- the memory
 size
  is one limitation on how many small files you can have
 
  2. The efficiency of map/reduce paradigm dictates that each
 mapper/reducer
  job has enough work to offset the overhead of spawning the job.  It
 relies
  on each task reading contiguous chuck of data (typically 64MB), your
 small
  file situation will change those efficient sequential reads to larger
  number
  of inefficient random reads.
 
  Of course, small is a relative term?
 
  Jonathan
 
  2009/5/6 陈桂芬 chenguifen...@163.com
 
   Hi:
  
   In my application, there are many small files. But the hadoop is
 designed
   to deal with many large files.
  
   I want to know why hadoop doesn’t support small files very well and
 where
   is the bottleneck. And what can I do to improve the Hadoop’s capability
  of
   dealing with small files.
  
   Thanks.
  
  
 



Re: Hadoop internal details

2009-05-07 Thread Nitay
This is better directed at the Hadoop mailing lists. I've added hadoop core
user mailing list to your query.

Cheers,
-n

On Thu, May 7, 2009 at 1:11 AM, monty123 mayurchou...@yahoo.com wrote:


 My query is how hadoop manages map files, files etc. stuffs. What is the
 internal data structure is uses to manage things.
 Whether it is graph of something..?
 Please help.

 --
 View this message in context:
 http://www.nabble.com/Hadoop-internal-details-tp23422156p23422156.html
 Sent from the HBase User mailing list archive at Nabble.com.




Re: About Hadoop optimizations

2009-05-07 Thread Tom White
On Thu, May 7, 2009 at 6:05 AM, Foss User foss...@gmail.com wrote:
 Thanks for your response again. I could not understand a few things in
 your reply. So, I want to clarify them. Please find my questions
 inline.

 On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote:
 On Wed, May 6, 2009 at 1:46 PM, Foss User foss...@gmail.com wrote:
 2. Is the meta data for file blocks on data node kept in the
 underlying OS's file system on namenode or is it kept in RAM of the
 name node?


 The block locations are kept in the RAM of the name node, and are updated
 whenever a Datanode does a block report. This is why the namenode is in
 safe mode at startup until it has received block locations for some
 configurable percentage of blocks from the datanodes.


 What is safe mode in namenode? This concept is new to me. Could you
 please explain this?

Safe mode is described here:
http://hadoop.apache.org/core/docs/r0.20.0/hdfs_design.html#Safemode




 3. If no mapper more mapper functions can be run on the node that
 contains the data on which the mapper has to act on, is Hadoop
 intelligent enough to run the new mappers on some machines within the
 same rack?


 Yes, assuming you have configured a network topology script. Otherwise,
 Hadoop has no magical knowledge of your network infrastructure, and it
 treats the whole cluster as a single rack called /default-rack


 Is it a network topology script or is it a Java plugin code? AFAIK, we
 need to write an implementation of
 org.apache.hadoop.net.DNSToSwitchMapping interface. Can we write it as
 a script or configuration file and avoid Java coding to achieve this?
 If so, how?


To tell Hadoop about your network topology you can either write a Java
implementation of org.apache.hadoop.net.DNSToSwitchMapping or you can
write a script in another language. There are more details at
http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness
and a sample script at
http://www.nabble.com/Hadoop-topology.script.file.name-Form-td17683521.html


RE: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread JQ Hadoop
I believe Jrockit JVM have slightly higer startup time than the SUN JVM; but
that should not make a lot of difference, especially if JVMs are reused in
0.19.

Which Hadoop version are you using?  What Hadoop job are you running? And
what performance do you get?

Thanks,
JQ

-Original Message-
From: Grace
Sent: Wednesday, May 06, 2009 1:07 PM
To: core-user@hadoop.apache.org
Subject: Is there any performance issue with Jrockit JVM for Hadoop

Hi all,
This is Grace.
I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the same
Java options and configuration as Sun JVM.  However it is very strange that
the performance using Jrockit JVM is poorer than the one using Sun, such as
the map stage became slower.
Has anyone encountered the similar problem? Could you please give some
advise about it? Thanks a lot.

Regards,
Grace


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Grace
I am running  the test on 0.18.1 and 0.19.1. Both versions have the same
issue with JRockit JVM.  It is for the example sort job, to sort 20G data on
1+2 nodes.

Following is the result(version 0.18.1). The sort job running with JRockit
JVM took 260 secs more than that with Sun JVM.
---
|| JVM  || Completion Time ||
---
|| JRockit  ||  786,315 msec||
|| Sun   ||  526,602 msec ||
---

Furthermore, under 0.19.1 version, I have set the reusing JVM parameter as
-1. It seems no improvement for JRockit JVM.

On Thu, May 7, 2009 at 4:32 PM, JQ Hadoop jq.had...@gmail.com wrote:

 I believe Jrockit JVM have slightly higer startup time than the SUN JVM;
 but
 that should not make a lot of difference, especially if JVMs are reused in
 0.19.

 Which Hadoop version are you using?  What Hadoop job are you running? And
 what performance do you get?

 Thanks,
 JQ

 -Original Message-
 From: Grace
 Sent: Wednesday, May 06, 2009 1:07 PM
 To: core-user@hadoop.apache.org
 Subject: Is there any performance issue with Jrockit JVM for Hadoop

 Hi all,
 This is Grace.
 I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all the
 same
 Java options and configuration as Sun JVM.  However it is very strange that
 the performance using Jrockit JVM is poorer than the one using Sun, such as
 the map stage became slower.
 Has anyone encountered the similar problem? Could you please give some
 advise about it? Thanks a lot.

 Regards,
 Grace



Hadoop internal details

2009-05-07 Thread monty123

My query is how hadoop manages map files, files etc. stuffs. What is the
internal data structure is uses to manage things.
Whether it is graph of something..?
Please help. 
-- 
View this message in context: 
http://www.nabble.com/Hadoop-internal-details-tp23423618p23423618.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



setGroupingComparatorClass() or setOutputValueGroupingComparator() does not work for Combiner

2009-05-07 Thread zsongbo
Hi all,
I have a application want the rules of sorting and grouping use
different Comparator.

I had tested 0.19.1 and 0.20.0 about this function, but both do not work for
Combiner.

In 0.19.1, I use job.setOutputValueGroupingComparator(), and
in 0.20.0, I use job.setGroupingComparatorClass()

This function is ok for reduce phase, the reduce phase can group the keys by
above Comparator, and sort by default comparator of the key class.

But I want the combiner can use a separator comparator for group, different
from sorting, is it possible?

Schubert


All keys went to single reducer in WordCount program

2009-05-07 Thread Foss User
I have two reducers running on two different machines. I ran the
example word count program with some of my own System.out.println()
statements to see what is going on.

There were 2 slaves each running datanode as well as tasktracker.
There was one namenode and one jobtracker. I know there is a very
elaborate setup for such a small cluster but I did it only to learn.

I gave two input files, a.txt and b.txt with a few lines of english
text. Now, here are my questions.

(1) I found that three mapper tasks ran, all in the first slave. The
first task processed the first file. The second task processed the
second file. The third task didn't process anything. Why is it that
the third task did not process anything? Why was this task created in
the first place?

(2) I found only one reducer task, on the second slave. It processed
all the values for keys. keys were words in this case of Text type. I
tried printing out the key.hashCode() for each key and some of them
were even and some of them were odd. I was expecting the keys with
even hashcodes to go to one slave and the others to go to another
slave. Why didn't this happen?


Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Miles Osborne
with such a small data set who knows what will happen:  you are
probably hitting minimal limits of some kind

repeat this with more data

Miles

2009/5/7 Foss User foss...@gmail.com:
 I have two reducers running on two different machines. I ran the
 example word count program with some of my own System.out.println()
 statements to see what is going on.

 There were 2 slaves each running datanode as well as tasktracker.
 There was one namenode and one jobtracker. I know there is a very
 elaborate setup for such a small cluster but I did it only to learn.

 I gave two input files, a.txt and b.txt with a few lines of english
 text. Now, here are my questions.

 (1) I found that three mapper tasks ran, all in the first slave. The
 first task processed the first file. The second task processed the
 second file. The third task didn't process anything. Why is it that
 the third task did not process anything? Why was this task created in
 the first place?

 (2) I found only one reducer task, on the second slave. It processed
 all the values for keys. keys were words in this case of Text type. I
 tried printing out the key.hashCode() for each key and some of them
 were even and some of them were odd. I was expecting the keys with
 even hashcodes to go to one slave and the others to go to another
 slave. Why didn't this happen?




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: move tasks to another machine on the fly

2009-05-07 Thread Sharad Agarwal


 Just one more question, does Hadoop handles reassign of task failure
 to different machines in some way?
Yes. If task fails then it is retried, preferably on a different machine.


 I saw that sometimes, usually at the end, when there are more
 processing units available than map() tasks to process, the same
 map() tasks might be processed twice, then one is killed when the
 other finish first.
This is called speculative execution. Jobtracker monitors the progress
of tasks and if progress for an individual task is slow, it launches
another task. Whichever finishes first is used and other
one is killed.

- Sharad


Re: PIG and Hive

2009-05-07 Thread Alan Gates

SQL has been on Pig's roadmap for some time, see 
http://wiki.apache.org/pig/ProposedRoadMap

We would like to add SQL support to Pig sometime this year.  We don't  
have an ETA or a JIRA for it yet.


Alan.

On May 6, 2009, at 11:20 PM, Amr Awadallah wrote:


Yiping,

(1) Any ETA for when that will become available?
(2) Where can we read more about the SQL functionality it will  
support?


(3) Where is the JIRA for this?

Thanks,

-- amr

Luc Hunt wrote:

Ricky,

One thing to mention is, SQL support is on the Pig roadmap this year.


--Yiping

On Wed, May 6, 2009 at 9:11 PM, Ricky Ho r...@adobe.com wrote:



Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language  
(as a
algorithm design / prototyping tool) to express my parallel  
algorithms in a
concise way.  The deeper I look into these, I have a stronger  
feeling that
PIG and HIVE are competitors rather than complementing each  
other.  I think
a large set of problems can be done in either way, without much  
difference

in terms of skillset requirements.

At this moment, I am focus in the richness of the language model  
rather
than the implementation optimization.  Supporting collection as  
well as
the flatten operation in the language model seems to make PIG more  
powerful.
Yes, you can achieve the same thing in Hive but then it starts to  
look odd.

Am I missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below)  
into many

fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations  
converge and
in the future they won't differ much.  But for now, Pig optimizes  
much

better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and  
a.y = b.y

group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did  
it in 5

map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations  
necessary to
make that one pass, but those sort of performance optimizations  
aren't

there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work  
on the
various optimizations.  Pig and Hive (and perhaps CloudBase and  
Cascading?)

could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, Olga Natkovich ol...@yahoo-inc.com wrote:



Hi Ricky,

This is how the code will look in Pig.

A = load 'textdoc' using TextLoader() as (sentence: chararray);
B = foreach A generate flatten(TOKENIZE(sentence)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into 'wordcount';

Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
explains how the example above works.

Let me know if you have further questions.

Olga




-Original Message-
From: Ricky Ho [mailto:r...@adobe.com]
Sent: Wednesday, May 06, 2009 3:56 PM
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks Amr,

Without knowing the details of Hive, one constraint of SQL
model is you can never generate more than one records from a
single record.  I don't know how this is done in Hive.
Another question is whether the Hive script can take in
user-defined functions ?

Using the following word count as an example.  Can you show
me how the Pig script and Hive script looks like ?

Map:
 Input: a line (a collection of words)
 Output: multiple [word, 1]

Reduce:
 Input: [word, [1, 1, 1, ...]]
 Output: [word, count]

Rgds,
Ricky

-Original Message-
From: Amr Awadallah [mailto:a...@cloudera.com]
Sent: Wednesday, May 06, 2009 3:14 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive



The difference between PIG and Hive seems to be pretty


insignificant.

Difference between Pig and Hive is significant, specifically:

(1) Pig doesn't require underlying structure to the data,
Hive does imply structure via a metastore. This has it pros
and cons. It allows Pig to be more suitable for ETL kind
tasks where the input data is still a mish-mash and you want
to convert it to be structured. On the other hand, Hive's
metastore provides a dictionary that lets you easily see what
columns exist in which tables which can be very handy.

(2) Pig is a new language, easy to learn if you know
languages similar to Perl. Hive is a sub-set of SQL with very
simple variations to enable map-reduce like computation. So,
if you come from a SQL background you will find Hive QL
extremely easy to pickup (many of your SQL queries will run
as is), while if you come from a procedural programming
background (w/o SQL knowledge) then Pig will be much more
suitable for you. Furthermore, Hive is a bit easier to
integrate 

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
 Hey,

 You can read more about why small files are difficult for HDFS at
 http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.

 Regards,
 Jeff

 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com

 If You want to use many small files, they are probably having the same
 purpose and struc?
 Why not use HBase instead of a raw HDFS ? Many small files would be packed
 together and the problem would disappear.

 cheers
 Piotr

 2009/5/7 Jonathan Cao jonath...@rockyou.com

  There are at least two design choices in Hadoop that have implications
 for
  your scenario.
  1. All the HDFS meta data is stored in name node memory -- the memory
 size
  is one limitation on how many small files you can have
 
  2. The efficiency of map/reduce paradigm dictates that each
 mapper/reducer
  job has enough work to offset the overhead of spawning the job.  It
 relies
  on each task reading contiguous chuck of data (typically 64MB), your
 small
  file situation will change those efficient sequential reads to larger
  number
  of inefficient random reads.
 
  Of course, small is a relative term?
 
  Jonathan
 
  2009/5/6 陈桂芬 chenguifen...@163.com
 
   Hi:
  
   In my application, there are many small files. But the hadoop is
 designed
   to deal with many large files.
  
   I want to know why hadoop doesn't support small files very well and
 where
   is the bottleneck. And what can I do to improve the Hadoop's capability
  of
   dealing with small files.
  
   Thanks.
  
  
 


When the small file problem comes up most of the talk centers around
the inode table being in memory. The cloudera blog points out
something:

Furthermore, HDFS is not geared up to efficiently accessing small
files: it is primarily designed for streaming access of large files.
Reading through small files normally causes lots of seeks and lots of
hopping from datanode to datanode to retrieve each small file, all of
which is an inefficient data access pattern.

My application attempted to load 9000 6Kb files using a single
threaded application and the FSOutpustStream objects to write directly
to hadoop files. My plan was to have hadoop merge these files in the
next step. I had to abandon this plan because this process was taking
hours. I knew HDFS had a small file problem but I never realized
that I could not do this problem the 'old fashioned way'. I merged the
files locally and uploading a few small files gave great throughput.
Small files is not just a permanent storage issue it is a serious
optimization.


Re: All keys went to single reducer in WordCount program

2009-05-07 Thread jason hadoop
Most likely the 3rd mapper ran as a speculative execution, and it is
possible that all of your keys hashed to a single partition. Also, if you
don't specify the default is to run a single reduce task.

From JobConf,
/**
   * Get configured the number of reduce tasks for this job. Defaults to
   * code1/code.
   *
   * @return the number of reduce tasks for this job.
   */
  public int getNumReduceTasks() { return getInt(mapred.reduce.tasks, *1*);
}


On Thu, May 7, 2009 at 3:54 AM, Miles Osborne mi...@inf.ed.ac.uk wrote:

 with such a small data set who knows what will happen:  you are
 probably hitting minimal limits of some kind

 repeat this with more data

 Miles

 2009/5/7 Foss User foss...@gmail.com:
  I have two reducers running on two different machines. I ran the
  example word count program with some of my own System.out.println()
  statements to see what is going on.
 
  There were 2 slaves each running datanode as well as tasktracker.
  There was one namenode and one jobtracker. I know there is a very
  elaborate setup for such a small cluster but I did it only to learn.
 
  I gave two input files, a.txt and b.txt with a few lines of english
  text. Now, here are my questions.
 
  (1) I found that three mapper tasks ran, all in the first slave. The
  first task processed the first file. The second task processed the
  second file. The third task didn't process anything. Why is it that
  the third task did not process anything? Why was this task created in
  the first place?
 
  (2) I found only one reducer task, on the second slave. It processed
  all the values for keys. keys were words in this case of Text type. I
  tried printing out the key.hashCode() for each key and some of them
  were even and some of them were odd. I was expecting the keys with
  even hashcodes to go to one slave and the others to go to another
  slave. Why didn't this happen?
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread jason hadoop
The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.

2009/5/7 Edward Capriolo edlinuxg...@gmail.com

 2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
  Hey,
 
  You can read more about why small files are difficult for HDFS at
  http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
 
  Regards,
  Jeff
 
  2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
 
  If You want to use many small files, they are probably having the same
  purpose and struc?
  Why not use HBase instead of a raw HDFS ? Many small files would be
 packed
  together and the problem would disappear.
 
  cheers
  Piotr
 
  2009/5/7 Jonathan Cao jonath...@rockyou.com
 
   There are at least two design choices in Hadoop that have implications
  for
   your scenario.
   1. All the HDFS meta data is stored in name node memory -- the memory
  size
   is one limitation on how many small files you can have
  
   2. The efficiency of map/reduce paradigm dictates that each
  mapper/reducer
   job has enough work to offset the overhead of spawning the job.  It
  relies
   on each task reading contiguous chuck of data (typically 64MB), your
  small
   file situation will change those efficient sequential reads to larger
   number
   of inefficient random reads.
  
   Of course, small is a relative term?
  
   Jonathan
  
   2009/5/6 陈桂芬 chenguifen...@163.com
  
Hi:
   
In my application, there are many small files. But the hadoop is
  designed
to deal with many large files.
   
I want to know why hadoop doesn't support small files very well and
  where
is the bottleneck. And what can I do to improve the Hadoop's
 capability
   of
dealing with small files.
   
Thanks.
   
   
  
 
 
 When the small file problem comes up most of the talk centers around
 the inode table being in memory. The cloudera blog points out
 something:

 Furthermore, HDFS is not geared up to efficiently accessing small
 files: it is primarily designed for streaming access of large files.
 Reading through small files normally causes lots of seeks and lots of
 hopping from datanode to datanode to retrieve each small file, all of
 which is an inefficient data access pattern.

 My application attempted to load 9000 6Kb files using a single
 threaded application and the FSOutpustStream objects to write directly
 to hadoop files. My plan was to have hadoop merge these files in the
 next step. I had to abandon this plan because this process was taking
 hours. I knew HDFS had a small file problem but I never realized
 that I could not do this problem the 'old fashioned way'. I merged the
 files locally and uploading a few small files gave great throughput.
 Small files is not just a permanent storage issue it is a serious
 optimization.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Chris Collins
a couple of years back we did a lot of experimentation between sun's  
vm and jrocket.  We had initially assumed that jrocket was going to  
scream since thats what the press were saying.  In short, what we  
discovered was that certain jdk library usage was a little bit faster  
with jrocket, but for core vm performance such as synchronization,  
primitive operations the sun vm out performed.  We were not taking  
account of startup time, just raw code execution.  As I said, this was  
a couple of years back so things may of changed.


C
On May 7, 2009, at 2:17 AM, Grace wrote:

I am running  the test on 0.18.1 and 0.19.1. Both versions have the  
same
issue with JRockit JVM.  It is for the example sort job, to sort 20G  
data on

1+2 nodes.

Following is the result(version 0.18.1). The sort job running with  
JRockit

JVM took 260 secs more than that with Sun JVM.
---
|| JVM  || Completion Time ||
---
|| JRockit  ||  786,315 msec||
|| Sun   ||  526,602 msec ||
---

Furthermore, under 0.19.1 version, I have set the reusing JVM  
parameter as

-1. It seems no improvement for JRockit JVM.

On Thu, May 7, 2009 at 4:32 PM, JQ Hadoop jq.had...@gmail.com wrote:

I believe Jrockit JVM have slightly higer startup time than the SUN  
JVM;

but
that should not make a lot of difference, especially if JVMs are  
reused in

0.19.

Which Hadoop version are you using?  What Hadoop job are you  
running? And

what performance do you get?

Thanks,
JQ

-Original Message-
From: Grace
Sent: Wednesday, May 06, 2009 1:07 PM
To: core-user@hadoop.apache.org
Subject: Is there any performance issue with Jrockit JVM for Hadoop

Hi all,
This is Grace.
I am replacing Sun JVM with Jrockit JVM for Hadoop. Also I keep all  
the

same
Java options and configuration as Sun JVM.  However it is very  
strange that
the performance using Jrockit JVM is poorer than the one using Sun,  
such as

the map stage became slower.
Has anyone encountered the similar problem? Could you please give  
some

advise about it? Thanks a lot.

Regards,
Grace





Re: Large number of map output keys and performance issues.

2009-05-07 Thread jason hadoop
It may simply be that your JVM's are spending their time doing garbage
collection instead of running your tasks.
My book, in chapterr 6 has a section on how to tune your jobs, and how to
determine what to tune. That chapter is available now as an alpha.

On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Tiago,

 Here are a couple thoughts:

 1) How much data are you outputting? Obviously there is a certain amount of
 IO involved in actually outputting data versus not ;-)

 2) Are you using a reduce phase in this job? If so, since you're cutting
 off
 the data at map output time, you're also avoiding a whole sort computation
 which involves significant network IO, etc.

 3) What version of Hadoop are you running?

 Thanks
 -Todd

 On Wed, May 6, 2009 at 12:23 PM, Tiago Macambira macamb...@gmail.com
 wrote:

  I am developing a MR application w/ hadoop that is generating during it's
  map phase a really large number of output keys and it is having an
 abysmal
  performance.
 
  While just reading the said data takes 20 minutes and processing it but
 not
  outputting anything from the map takes around 30 min, running the full
  application takes around 4 hours. Is this a known or expected issue?
 
  Cheers.
  Tiago Alves Macambira
  --
  I may be drunk, but in the morning I will be sober, while you will
  still be stupid and ugly. -Winston Churchill
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Using multiple FileSystems in hadoop input

2009-05-07 Thread jason hadoop
I have used multiple file systems in jobs, but not used Har as one of them.
Worked for me in 18

On Wed, May 6, 2009 at 4:07 AM, Tom White t...@cloudera.com wrote:

 Hi Ivan,

 I haven't tried this combination, but I think it should work. If it
 doesn't it should be treated as a bug.

 Tom

 On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net
 wrote:
  Greetings to all,
 
  Could anyone suggest if Paths from different FileSystems can be used as
  input of Hadoop job?
 
  Particularly I'd like to find out whether Paths from HarFileSystem can be
  mixed with ones from DistributedFileSystem.
 
  Thanks,
 
 
  --
  Kind regards,
  Ivan
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Steve Loughran

Chris Collins wrote:
a couple of years back we did a lot of experimentation between sun's vm 
and jrocket.  We had initially assumed that jrocket was going to scream 
since thats what the press were saying.  In short, what we discovered 
was that certain jdk library usage was a little bit faster with jrocket, 
but for core vm performance such as synchronization, primitive 
operations the sun vm out performed.  We were not taking account of 
startup time, just raw code execution.  As I said, this was a couple of 
years back so things may of changed.


C


I run JRockit as its what some of our key customers use, and we need to 
test things. One lovely feature is tests time out before the stack runs 
out on a recursive operation; clearly different stack management at 
work. Another: no PermGenHeapSpace to fiddle with.


* I have to turn debug logging of in hadoop test runs, or there are 
problems.


* It uses short pointers (32 bits long) for near memory on a 64 bit JVM. 
So your memory footprint on sub-4GB VM images is better. Java7 promises 
this, and with the merger, who knows what we will see. This is 
unimportant  on 32-bit boxes


* debug single stepping doesnt work. That's ok, I use functional tests 
instead :)


I havent looked at outright performance.

/


Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Foss User
On Thu, May 7, 2009 at 8:51 PM, jason hadoop jason.had...@gmail.com wrote:
 Most likely the 3rd mapper ran as a speculative execution, and it is
 possible that all of your keys hashed to a single partition. Also, if you
 don't specify the default is to run a single reduce task.

As I mentioned in my first mail, I tried printiing out the hashCode()
for the keys myself in a manner like this:

System.out.println(key.hashCode());

key was of type Text. Some of the keys were even and some of them were
odd. So, I was expecting the odd ones to go to one slave and the even
ones to another. Is my expectation correct? Could you now throw some
light what else might have happened?


 From JobConf,
 /**
   * Get configured the number of reduce tasks for this job. Defaults to
   * code1/code.
   *
   * @return the number of reduce tasks for this job.
   */
  public int getNumReduceTasks() { return getInt(mapred.reduce.tasks, *1*);
 }


I configured mapred.reduce.tasks as 2. I did this configuration in
hadoop-site.xml of the job-tracker. Is this fine?


Is /room1 in the rack name /room1/rack1 significant during replication?

2009-05-07 Thread Foss User
I have written a rack awareness script which maps the IP addresses to
rack names in this way.

10.31.1.* - /room1/rack1
10.31.2.* - /room1/rack2
10.31.3.* - /room1/rack3
10.31.100.* - /room2/rack1
10.31.200.* - /room2/rack2
10.31.200.* - /room2/rack3

I understand that DFS will try to have replication of data in such a
manner that even if /room1/rack1 goes down, the data is still
available in other racks. I want to understand whether the hierarchy
of racks (like rack1 is in room1 here) is given any importance.

What I mean is, in addition to taking care that the data is unaffected
if /room1/rack1 goes down, will it also try to take care that almost
all data is replicated in the racks withiin /room2 so that if /room1
goes down as a whole (say there is a power cut in room1), we still
have all the data in racks of /room2?


Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Philip Zeyliger
It's over TCP/IP, in a custom protocol.  See DataXceiver.java.  My sense is
that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
for large messages.

-- Philip

On Thu, May 7, 2009 at 9:11 AM, Foss User foss...@gmail.com wrote:

 I understand that the blocks are transferred between various nodes
 using HDFS protocol. I believe, even the job classes are distributed
 as files using the same HDFS protocol.

 Is this protocol written over TCP/IP from scratch or this is a
 protocol that works on top of some other protocol like HTTP, etc.?



Re: PIG and Hive

2009-05-07 Thread Scott Carey
The work was done 3 months ago, and the exact query I used may not have been 
the below - it was functionally the same - two sources,  arithmetic aggregation 
on each inner-joined by a small set of values.  We wrote a hand-coded map 
reduce, a Pig script, and Hive against the same data and performance tested.

At that time, even SELECT count(a.z) FROM a group by a.z took 3 phases (not 
sure how many were fetch versus M/R).  Since then, we abandoned Hive for 
reassessment at a later date.  All releases of Hive since then 
http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything 
under optimizations and few of the enhancements listed suggest that there has 
been much change on the performance front (yet).

Can Hive not yet detect an implicit inner join in a WHERE clause?

Our use case would have less optimization-savvy people querying data ad-hoc, so 
being able to detect implicit joins and collapse subselects, etc is a 
requirement.  I'm not going to go sitting over the shoulder of everyone who 
wants to do some ad-hoc data analysis and tell them how to re-write their 
queries to perform better.
That is a big weakness of SQL that affects everything that uses it - there are 
so many equivalent or near-equivalent forms of expression that often lead to 
implementation specific performance preferences.

I'm sure Hive will get over that hump but it takes time.  I'm certainly 
interested in it and will have a deeper look again in the second half of this 
year.

On 5/7/09 10:12 AM, Namit Jain nj...@facebook.com wrote:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

If you do a explain on the above query, you will see that you are performing a 
Cartesian product followed by the filter.

It would be better to rewrite the query as:


SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = 
b.y)
group by a.x, a.y;

The explain should have 2 map-reduce jobs and a fetch task (which is not a 
map-reduce job).
Can you send me the exact Hive query that you are trying along with the schema 
of tables 'a' and 'b'.

In order to see the plan, you can do:

Explain
QUERY



Thanks,
-namit



-- Forwarded Message
From: Ricky Ho r...@adobe.com
Reply-To: core-user@hadoop.apache.org
Date: Wed, 6 May 2009 21:11:43 -0700
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a algorithm 
design / prototyping tool) to express my parallel algorithms in a concise way.  
The deeper I look into these, I have a stronger feeling that PIG and HIVE are 
competitors rather than complementing each other.  I think a large set of 
problems can be done in either way, without much difference in terms of 
skillset requirements.

At this moment, I am focus in the richness of the language model rather than 
the implementation optimization.  Supporting collection as well as the 
flatten operation in the language model seems to make PIG more powerful.  Yes, 
you can achieve the same thing in Hive but then it starts to look odd.  Am I 
missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many
fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and
in the future they won't differ much.  But for now, Pig optimizes much
better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5
map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to
make that one pass, but those sort of performance optimizations aren't there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?)
could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, Olga Natkovich ol...@yahoo-inc.com wrote:

 Hi Ricky,

 This is how the code will look in Pig.

 A = load 'textdoc' using TextLoader() as (sentence: chararray);
 B = foreach A generate flatten(TOKENIZE(sentence)) as word;
 C = group B by word;
 D = foreach C generate group, COUNT(B);
 store D into 'wordcount';

 Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
 explains how the example above works.

 Let me know if you have further questions.

 Olga


 -Original Message-
 From: Ricky Ho [mailto:r...@adobe.com]
 Sent: Wednesday, May 06, 2009 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: 

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Scott,

Namit is actually correct. If you do a explain on the query that he sent out, 
you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified 
that and that is consistent with what we should expect in this case. We would 
be very interested to know the exact query that you used as 5 map/reduce jobs 
is somewhat of a surprise to us.

Ricky,

Without SQL - at least PIG does not have that now, it is really not usable for 
people like data analysts at this time - people who have been brought up on SQL 
and do not necessarily have the skill set of learning another imperative 
programing language. PIG appeals more to the engineering users - our approach 
has been different though even in this respect. We have followed a philosophy 
of allowing even engineering users to write their custom code in an imperative 
programming language of their choice and be able to plugin that customized 
logic in different parts of the data flow. Again, this idea may appeal to some 
and may not appeal to others and it is really a subjective call when it comes 
to engineering users when you think from the language perspective.

Regarding collect and flatten, these have been in Hive roadmap for quite 
sometime (just as SQL has been on the pig roadmap :)) and we will put those 
into the language at some future release.

Ashish


-Original Message-
From: Namit Jain [mailto:nj...@facebook.com] 
Sent: Thursday, May 07, 2009 10:12 AM
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y 
group by x, y.

If you do a explain on the above query, you will see that you are performing a 
Cartesian product followed by the filter.

It would be better to rewrite the query as:


SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = 
b.y) group by a.x, a.y;

The explain should have 2 map-reduce jobs and a fetch task (which is not a 
map-reduce job).
Can you send me the exact Hive query that you are trying along with the schema 
of tables 'a' and 'b'.

In order to see the plan, you can do:

Explain
QUERY



Thanks,
-namit



-- Forwarded Message
From: Ricky Ho r...@adobe.com
Reply-To: core-user@hadoop.apache.org
Date: Wed, 6 May 2009 21:11:43 -0700
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a algorithm 
design / prototyping tool) to express my parallel algorithms in a concise way.  
The deeper I look into these, I have a stronger feeling that PIG and HIVE are 
competitors rather than complementing each other.  I think a large set of 
problems can be done in either way, without much difference in terms of 
skillset requirements.

At this moment, I am focus in the richness of the language model rather than 
the implementation optimization.  Supporting collection as well as the 
flatten operation in the language model seems to make PIG more powerful.  Yes, 
you can achieve the same thing in Hive but then it starts to look odd.  Am I 
missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many fewer 
map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and in 
the future they won't differ much.  But for now, Pig optimizes much better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y 
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map 
reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to make 
that one pass, but those sort of performance optimizations aren't there yet.  
That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the 
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?) 
could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, Olga Natkovich ol...@yahoo-inc.com wrote:

 Hi Ricky,

 This is how the code will look in Pig.

 A = load 'textdoc' using TextLoader() as (sentence: chararray); B = 
 foreach A generate flatten(TOKENIZE(sentence)) as word; C = group B by 
 word; D = foreach C generate group, COUNT(B); store D into 
 'wordcount';

 Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
 explains how the example above works.

 Let me know if you have further questions.

 Olga


 -Original Message-
 From: Ricky Ho [mailto:r...@adobe.com]
 Sent: Wednesday, May 06, 2009 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: RE: PIG and Hive

 Thanks Amr,

RE: PIG and Hive

2009-05-07 Thread Ashish Thusoo
Ok that explains a lot of that. When we started off Hive our immediate usecase 
was to do group bys on data with a lot of skew on the grouping keys. In that 
scenario it is better to do this in 2 map/reduce jobs using the first one to 
randomly distribute data and generating the partial sums followed by another 
one that does the complete sums. This was originally the default plan in Hive. 
Since then we have moved the default to just using a single map/reduce job and 
using

hive.exec.skeweddata = true as a parameter to trigger the older behavior. 

We already collapse subselects. We already do predicate pushdown and column 
pruning. We don't yet do subexpression elimination but that will happen soon. 
Implicit detection of an inner join is possible though we never had a JIRA 
asking for it. Will open one soon...

I am sure you will not be disappointed by the capabilities of the system when 
you try it again.. Feel free to mail hive-us...@hadoop.apache.org for any 
clarifications/help/optimization questions.

Cheers,
Ashish

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com] 
Sent: Thursday, May 07, 2009 11:08 AM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

The work was done 3 months ago, and the exact query I used may not have been 
the below - it was functionally the same - two sources,  arithmetic aggregation 
on each inner-joined by a small set of values.  We wrote a hand-coded map 
reduce, a Pig script, and Hive against the same data and performance tested.

At that time, even SELECT count(a.z) FROM a group by a.z took 3 phases (not 
sure how many were fetch versus M/R).  Since then, we abandoned Hive for 
reassessment at a later date.  All releases of Hive since then 
http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything 
under optimizations and few of the enhancements listed suggest that there has 
been much change on the performance front (yet).

Can Hive not yet detect an implicit inner join in a WHERE clause?

Our use case would have less optimization-savvy people querying data ad-hoc, so 
being able to detect implicit joins and collapse subselects, etc is a 
requirement.  I'm not going to go sitting over the shoulder of everyone who 
wants to do some ad-hoc data analysis and tell them how to re-write their 
queries to perform better.
That is a big weakness of SQL that affects everything that uses it - there are 
so many equivalent or near-equivalent forms of expression that often lead to 
implementation specific performance preferences.

I'm sure Hive will get over that hump but it takes time.  I'm certainly 
interested in it and will have a deeper look again in the second half of this 
year.

On 5/7/09 10:12 AM, Namit Jain nj...@facebook.com wrote:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y 
group by x, y.

If you do a explain on the above query, you will see that you are performing a 
Cartesian product followed by the filter.

It would be better to rewrite the query as:


SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = 
b.y) group by a.x, a.y;

The explain should have 2 map-reduce jobs and a fetch task (which is not a 
map-reduce job).
Can you send me the exact Hive query that you are trying along with the schema 
of tables 'a' and 'b'.

In order to see the plan, you can do:

Explain
QUERY



Thanks,
-namit



-- Forwarded Message
From: Ricky Ho r...@adobe.com
Reply-To: core-user@hadoop.apache.org
Date: Wed, 6 May 2009 21:11:43 -0700
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a algorithm 
design / prototyping tool) to express my parallel algorithms in a concise way.  
The deeper I look into these, I have a stronger feeling that PIG and HIVE are 
competitors rather than complementing each other.  I think a large set of 
problems can be done in either way, without much difference in terms of 
skillset requirements.

At this moment, I am focus in the richness of the language model rather than 
the implementation optimization.  Supporting collection as well as the 
flatten operation in the language model seems to make PIG more powerful.  Yes, 
you can achieve the same thing in Hive but then it starts to look odd.  Am I 
missing something Hive folks ?

Rgds,
Ricky

-Original Message-
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many fewer 
map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and in 
the future they won't differ much.  But for now, Pig optimizes much better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and 

.gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
Problem:

I am comparing two jobs.  The both have the same input content, however
in one job the input file has been gziped, and in the other it has not.
I get far less output rows in the gzipped result than I do in the
uncompressed version:

 

Lines in output:

Gzipped: 86851

Uncompressed: 6569303

 

The gzipped input file is 875MB in size, and the entire job runs in
about 30 seconds.  The uncompressed file takes around 5 minutes to run.

 

Hadoop version:

0.18.1, r694836

 

Here is the output of the map task of the compressed input:

2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=

2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 12

2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
io.sort.mb = 100

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680

2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library

2009-05-07 14:54:54,005 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded 
initialized native-zlib library

2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 45410962; bufvoid = 99614720

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
0; kvend = 87923; length = 327680

2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
(0, 3786199, 3786199)

2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
(3786199, 3789579, 3789579)

2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
(7575778, 3859183, 3859183)

2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
(11434961, 3792449, 3792449)

2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
(15227410, 3818963, 3818963)

2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
(19046373, 3780875, 3780875)

2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
(22827248, 3814950, 3814950)

2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
(26642198, 3871426, 3871426)

2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
(30513624, 3799971, 3799971)

2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
(34313595, 3813327, 3813327)

2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
(38126922, 3835208, 3835208)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
(41962130, 3747048, 3747048)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0

2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200905071451_0001_m_00_0: No outputs to promote from
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
_temporary/_attempt_200905071451_0001_m_00_0

2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200905071451_0001_m_00_0' done.

 

 

Am I doing something wrong?  Is there anything else I can do to debug
this?  Is it a known bug?

 

Let me know if you need anything else, thanks.



Re: .gz input files having less output than uncompressed version

2009-05-07 Thread tim robertson
Hi,

What input format are you using for the GZipped file?

I don't believe there is a GZip input format although some people have
 discussed whether it is feasible...

Cheers

Tim

On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
mmata...@millennialmedia.com wrote:
 Problem:

 I am comparing two jobs.  The both have the same input content, however
 in one job the input file has been gziped, and in the other it has not.
 I get far less output rows in the gzipped result than I do in the
 uncompressed version:



 Lines in output:

 Gzipped: 86851

 Uncompressed: 65693I03



 The gzipped input file is 875MB in size, and the entire job runs in
 about 30 seconds.  The uncompressed file takes around 5 minutes to run.



 Hadoop version:

 0.18.1, r694836



 Here is the output of the map task of the compressed input:

 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=MAP, sessionId=

 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
 numReduceTasks: 12

 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
 io.sort.mb = 100

 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
 buffer = 79691776/99614720

 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
 buffer = 262144/327680

 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
 Loaded the native-hadoop library

 2009-05-07 14:54:54,005 INFO
 org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded 
 initialized native-zlib library

 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
 flush of map output

 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
 = 0; bufend = 45410962; bufvoid = 99614720

 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
 0; kvend = 87923; length = 327680

 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
 (0, 3786199, 3786199)

 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
 (3786199, 3789579, 3789579)

 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
 (7575778, 3859183, 3859183)

 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
 (11434961, 3792449, 3792449)

 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
 (15227410, 3818963, 3818963)

 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
 (19046373, 3780875, 3780875)

 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
 (22827248, 3814950, 3814950)

 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
 (26642198, 3871426, 3871426)

 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
 (30513624, 3799971, 3799971)

 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
 (34313595, 3813327, 3813327)

 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
 (38126922, 3835208, 3835208)

 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
 (41962130, 3747048, 3747048)

 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
 spill 0

 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
 attempt_200905071451_0001_m_00_0: No outputs to promote from
 hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
 _temporary/_attempt_200905071451_0001_m_00_0

 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
 'attempt_200905071451_0001_m_00_0' done.





 Am I doing something wrong?  Is there anything else I can do to debug
 this?  Is it a known bug?



 Let me know if you need anything else, thanks.




Re: Is it possible to sort intermediate values and final values?

2009-05-07 Thread Foss User
On Thu, May 7, 2009 at 3:10 AM, Owen O'Malley omal...@apache.org wrote:

 On May 6, 2009, at 12:15 PM, Foss User wrote:

 Is it possible to sort the intermediate values for each key before
 they key, list of values pair reaches the reducer?

 Look at the example SecondarySort.

Where can I find this example. I was not able to find it in the
src/examples directory.


RE: .gz input files having less output than uncompressed version

2009-05-07 Thread Malcolm Matalka
This is the result of running gzip on the input files.  There appears to be 
some support for two reasons:

1) I do get some output in my results.  There are 86851 lines in my output 
file, and they are valid results.

2) In the job task output I pasted it states: 
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded  
initialized native-zlib library, suggesting it has determined what compression 
codec to use.


-Original Message-
From: tim robertson [mailto:timrobertson...@gmail.com] 
Sent: Thursday, May 07, 2009 15:29
To: core-user@hadoop.apache.org
Subject: Re: .gz input files having less output than uncompressed version

Hi,

What input format are you using for the GZipped file?

I don't believe there is a GZip input format although some people have
 discussed whether it is feasible...

Cheers

Tim

On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
mmata...@millennialmedia.com wrote:
 Problem:

 I am comparing two jobs.  The both have the same input content, however
 in one job the input file has been gziped, and in the other it has not.
 I get far less output rows in the gzipped result than I do in the
 uncompressed version:



 Lines in output:

 Gzipped: 86851

 Uncompressed: 65693I03



 The gzipped input file is 875MB in size, and the entire job runs in
 about 30 seconds.  The uncompressed file takes around 5 minutes to run.



 Hadoop version:

 0.18.1, r694836



 Here is the output of the map task of the compressed input:

 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=MAP, sessionId=

 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
 numReduceTasks: 12

 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
 io.sort.mb = 100

 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
 buffer = 79691776/99614720

 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
 buffer = 262144/327680

 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
 Loaded the native-hadoop library

 2009-05-07 14:54:54,005 INFO
 org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded 
 initialized native-zlib library

 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
 flush of map output

 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
 = 0; bufend = 45410962; bufvoid = 99614720

 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
 0; kvend = 87923; length = 327680

 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
 (0, 3786199, 3786199)

 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
 (3786199, 3789579, 3789579)

 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
 (7575778, 3859183, 3859183)

 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
 (11434961, 3792449, 3792449)

 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
 (15227410, 3818963, 3818963)

 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
 (19046373, 3780875, 3780875)

 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
 (22827248, 3814950, 3814950)

 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
 (26642198, 3871426, 3871426)

 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
 (30513624, 3799971, 3799971)

 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
 (34313595, 3813327, 3813327)

 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
 (38126922, 3835208, 3835208)

 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
 (41962130, 3747048, 3747048)

 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
 spill 0

 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
 attempt_200905071451_0001_m_00_0: No outputs to promote from
 hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
 _temporary/_attempt_200905071451_0001_m_00_0

 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
 'attempt_200905071451_0001_m_00_0' done.





 Am I doing something wrong?  Is there anything else I can do to debug
 this?  Is it a known bug?



 Let me know if you need anything else, thanks.




Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Raghu Angadi



Philip Zeyliger wrote:

It's over TCP/IP, in a custom protocol.  See DataXceiver.java.  My sense is
that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
for large messages.


yes, and job classes are not distributed using this. It is a very simple 
protocol used to read and write raw data to DataNodes.



-- Philip

On Thu, May 7, 2009 at 9:11 AM, Foss User foss...@gmail.com wrote:


I understand that the blocks are transferred between various nodes
using HDFS protocol. I believe, even the job classes are distributed
as files using the same HDFS protocol.

Is this protocol written over TCP/IP from scratch or this is a
protocol that works on top of some other protocol like HTTP, etc.?







Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Foss User
On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi rang...@yahoo-inc.com wrote:


 Philip Zeyliger wrote:

 It's over TCP/IP, in a custom protocol.  See DataXceiver.java.  My sense
 is
 that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
 for large messages.

 yes, and job classes are not distributed using this. It is a very simple
 protocol used to read and write raw data to DataNodes.

How are the job class files or jar files distributed then?


NullPointerException while trying to copy file

2009-05-07 Thread Foss User
I was trying to write a Java code to copy a file from local system to
a file system (which is also local file system). This is my code.

package in.fossist.examples;

import java.io.File;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.FileUtil;

public class FileOps
{
public static void main(String[] args) throws IOException
{
FileUtil.copy(new File(a.txt),
  new LocalFileSystem(),
  new Path(b.txt),
  false,
  new Configuration());
}
}

This is the error:

ubuntu:/opt/hadoop-0.19.1# bin/hadoop jar fileops-0.1.jar
in.fossist.examples.FileOps
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:286)
at in.fossist.examples.FileOps.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

Please help me to fix this error.


Re: NullPointerException while trying to copy file

2009-05-07 Thread Todd Lipcon
On Thu, May 7, 2009 at 1:26 PM, Foss User foss...@gmail.com wrote:

 I was trying to write a Java code to copy a file from local system to
 a file system (which is also local file system). This is my code.

 package in.fossist.examples;

 import java.io.File;
 import java.io.IOException;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.fs.LocalFileSystem;
 import org.apache.hadoop.fs.FileUtil;

 public class FileOps
 {
public static void main(String[] args) throws IOException
{
FileUtil.copy(new File(a.txt),
  new LocalFileSystem(),


You can't create a FileSystem like this. You should do something like:

Path src = new Path(a.txt);
Path dst = new Path(b.txt);
Configuration conf = new Configuration();

FileUtil.copy(src.getFileSystem(conf), src, dst.getFileSystem(conf), dst,
false, conf);


  new Path(b.txt),
  false,
  new Configuration());
}
 }

 This is the error:

 ubuntu:/opt/hadoop-0.19.1# bin/hadoop jar fileops-0.1.jar
 in.fossist.examples.FileOps
 java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:367)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:286)
at in.fossist.examples.FileOps.main(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Please help me to fix this error.



Re: Is it possible to sort intermediate values and final values?

2009-05-07 Thread Owen O'Malley


On May 7, 2009, at 12:38 PM, Foss User wrote:


Where can I find this example. I was not able to find it in the
src/examples directory.


It is in 0.20.

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java

-- Owen


Re: NullPointerException while trying to copy file

2009-05-07 Thread Foss User
On Fri, May 8, 2009 at 1:59 AM, Todd Lipcon t...@cloudera.com wrote:
 On Thu, May 7, 2009 at 1:26 PM, Foss User foss...@gmail.com wrote:

 I was trying to write a Java code to copy a file from local system to
 a file system (which is also local file system). This is my code.

 package in.fossist.examples;

 import java.io.File;
 import java.io.IOException;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.fs.LocalFileSystem;
 import org.apache.hadoop.fs.FileUtil;

 public class FileOps
 {
    public static void main(String[] args) throws IOException
    {
        FileUtil.copy(new File(a.txt),
                      new LocalFileSystem(),


 You can't create a FileSystem like this. You should do something like:

 Path src = new Path(a.txt);
 Path dst = new Path(b.txt);
 Configuration conf = new Configuration();

 FileUtil.copy(src.getFileSystem(conf), src, dst.getFileSystem(conf), dst,
 false, conf);

This does not work for me as you are reading the a.txt from the DFS
while I want to read the a.txt from the local file system. Also, I
do not want to copy the file to the distributed file system. Instead I
want to copy it to LocalFileSystem.


Re: NullPointerException while trying to copy file

2009-05-07 Thread Todd Lipcon
On Thu, May 7, 2009 at 1:47 PM, Foss User foss...@gmail.com wrote:


 This does not work for me as you are reading the a.txt from the DFS
 while I want to read the a.txt from the local file system. Also, I
 do not want to copy the file to the distributed file system. Instead I
 want to copy it to LocalFileSystem.


Only if your fs.default.name is the HDFS. If you want the local file system,
use file://path/to/a.txt for your Paths

-Todd


Re: java.io.EOFException: while trying to read 65557 bytes

2009-05-07 Thread Raghu Angadi

Albert Sunwoo wrote:

Thanks for the info!

I was hoping to get some more specific information though. 


in short : we need to more info.

There are typically 4 machines/processes involved in a write : the 
client and 3 datanodes writing the replicas. To see what really 
happened, you need to provide error message(s) for this block on these 
other parts (at least on 3 datanodes should be useful).


This particular error just implies this datanode is the 2nd of the 3 
datanodes (assuming replication of 3) in the write pipeline and its 
connection from the 1st datanode was closed. To deduce more we need more 
info... starting with what happened to that block on the first datanode.


also the 3rd datanode is 10.102.0.106, the block you should grep for in 
other logs is blk_-7056150840276493498 etc..


You should try to see what could be useful information for others 
diagnose the problem... more than likely you will find the cause 
yourself in the process.


Raghu.


We are seeing these occur during every run, and as such it's not leaving some 
folks in our organization with a good feeling about the reliability of HDFS.
Do these occur as a result of resources being unavailable?  Perhaps the nodes 
are too busy and can no longer service reads from other nodes?  Or if the jobs 
are causing too much network traffic?  At first glance the machines do not 
seemed to be pinned, however I am wondering if sudden bursts of jobs can be 
causing these as well.  If so does anyone have configuration recommendations to 
minimize or remove these errors under any of these circumstances, or perhaps 
there is another explanation?

Thanks,
Albert

On 5/5/09 11:34 AM, Raghu Angadi rang...@yahoo-inc.com wrote:



This can happen for example when a client is killed when it has some
files open for write. In that case it is an expected error (the log
should really be at WARN or INFO level).

Raghu.

Albert Sunwoo wrote:

Hello Everyone,

I know there's been some chatter about this before but I am seeing the errors 
below on just about every one of our nodes.  Is there a definitive reason on 
why these are occuring, is there something that we can do to prevent these?

2009-05-04 21:35:11,764 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.102.0.105:50010, 
storageID=DS-991582569-127.0.0.1-50010-1240886381606, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:619)

Followed by:
2009-05-04 21:35:20,891 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder blk_-7056150840276493498_10885 1 Exception 
java.io.InterruptedIOException: Interruped while waiting for IO on channel 
java.nio.channels.Socke
tChannel[connected local=/10.102.0.105:37293 remote=/10.102.0.106:50010]. 59756 
millis timeout left.
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:277)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853)
at java.lang.Thread.run(Thread.java:619)

Thanks,
Albert









Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread JQ Hadoop
There are a lot of tuning knobs for the JRockit JVM when it comes to
performance; those tuning can make a huge difference. I'm very interested if
there are some tuning tips for Hadoop.

Grace, what are the parameters that you used in your testing?

Thanks,
JQ

On Thu, May 7, 2009 at 11:35 PM, Steve Loughran ste...@apache.org wrote:

 Chris Collins wrote:

 a couple of years back we did a lot of experimentation between sun's vm
 and jrocket.  We had initially assumed that jrocket was going to scream
 since thats what the press were saying.  In short, what we discovered was
 that certain jdk library usage was a little bit faster with jrocket, but for
 core vm performance such as synchronization, primitive operations the sun vm
 out performed.  We were not taking account of startup time, just raw code
 execution.  As I said, this was a couple of years back so things may of
 changed.

 C


 I run JRockit as its what some of our key customers use, and we need to
 test things. One lovely feature is tests time out before the stack runs out
 on a recursive operation; clearly different stack management at work.
 Another: no PermGenHeapSpace to fiddle with.

 * I have to turn debug logging of in hadoop test runs, or there are
 problems.

 * It uses short pointers (32 bits long) for near memory on a 64 bit JVM. So
 your memory footprint on sub-4GB VM images is better. Java7 promises this,
 and with the merger, who knows what we will see. This is unimportant  on
 32-bit boxes

 * debug single stepping doesnt work. That's ok, I use functional tests
 instead :)

 I havent looked at outright performance.

 /



HDFS to S3 copy problems

2009-05-07 Thread Ken Krugler

Hi all,

I have a few large files (4 that are 1.8GB+) I'm trying to copy from 
HDFS to S3. My micro EC2 cluster is running Hadoop 0.19.1, and has 
one master/two slaves.


I first tried using the hadoop fs -cp command, as in:

hadoop fs -cp output/dir/ s3n://bucket/dir/

This seemed to be working, as I could walk the network traffic spike, 
and temp files were being created in S3 (as seen with CyberDuck).


But then it seemed to hang. Nothing happened for 30 minutes, so I 
killed the command.


Then I tried using the hadoop distcp command, as in:

hadoop distcp hdfs://host:50001/path/dir/ s3://public 
key:private key@bucket/dir2/


This failed, because my secret key has a '/' in it 
(http://issues.apache.org/jira/browse/HADOOP-3733)


Then I tried using hadoop distcp with the s3n URI syntax:

hadoop distcp hdfs://host:50001/path/dir/ s3n://bucket/dir2/

Similar to my first attempt, it seemed to work. Lots of network 
activity, temp files being created, and in the terminal I got:


09/05/07 18:36:11 INFO mapred.JobClient: Running job: job_200905071339_0004
09/05/07 18:36:12 INFO mapred.JobClient:  map 0% reduce 0%
09/05/07 18:36:30 INFO mapred.JobClient:  map 9% reduce 0%
09/05/07 18:36:35 INFO mapred.JobClient:  map 14% reduce 0%
09/05/07 18:36:38 INFO mapred.JobClient:  map 20% reduce 0%

But again it hung. No network traffic, and eventually it dumped out:

09/05/07 18:52:34 INFO mapred.JobClient: Task Id : 
attempt_200905071339_0004_m_01_0, Status : FAILED
Task attempt_200905071339_0004_m_01_0 failed to report status for 
601 seconds. Killing!
09/05/07 18:53:02 INFO mapred.JobClient: Task Id : 
attempt_200905071339_0004_m_04_0, Status : FAILED
Task attempt_200905071339_0004_m_04_0 failed to report status for 
602 seconds. Killing!
09/05/07 18:53:06 INFO mapred.JobClient: Task Id : 
attempt_200905071339_0004_m_02_0, Status : FAILED
Task attempt_200905071339_0004_m_02_0 failed to report status for 
602 seconds. Killing!
09/05/07 18:53:09 INFO mapred.JobClient: Task Id : 
attempt_200905071339_0004_m_03_0, Status : FAILED
Task attempt_200905071339_0004_m_03_0 failed to report status for 
601 seconds. Killing!


In the task GUI, I can see the same tasks failing, and being 
restarted. But the restarted tasks seem to be just hanging w/o doing 
anything.


Eventually one of the tasks made a bit more progress, but then it 
finally died with:


Copy failed: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:647)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)

So - any thoughts on what's going wrong?

Thanks,

-- Ken
--
Ken Krugler
+1 530-210-6378


RE: On usig Eclipse IDE

2009-05-07 Thread georgep

Hi Asseem, 

Thank you, but after fs.trash.interval I got something else.  Maybe my
version is not correct.  What is your Eclipse europa version?  

George



Puri, Aseem wrote:
 
 George,
   In my Eclipse Europa it is showing the attribute
 hadoop.job.ugi. It is after the fs.trash.interval.
 
 Thanks  Regards
 Aseem Puri
 
 
 
 -Original Message-
 From: George Pang [mailto:p09...@gmail.com] 
 Sent: Wednesday, May 06, 2009 1:07 PM
 To: core-user@hadoop.apache.org; gene...@hadoop.apache.org
 Subject: On usig Eclipse IDE
 
 Dear Users,
 
 I configure Eclipse Europa according to Yahoo tutorial on hadoop:
 http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html
 
 and in the instruction it goes about creating new DFS Location:
 
 .Next, click on the Advanced tab. There are two settings here
 which
 must be changed.
 
 Scroll down to hadoop.job.ugi. It contains your current Windows login
 credentials. Highlight the first comma-separated value in this list
 (your
 username) and replace it with hadoop-user.
 
 I can't find this attribute(hadoop.job.ugi) in the advance list from
 Define
 Hadoop location on Eclipse. Do you have an idea?
 
 

-- 
View this message in context: 
http://www.nabble.com/On-usig-Eclipse-IDE-tp23401529p23437613.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: HDFS to S3 copy problems

2009-05-07 Thread Andrew Hitchcock
Hi Ken,

S3N doesn't work that well with large files. When uploading a file to
S3, S3N saves it to local disk during write() and then uploads to S3
during the close(). Close can take a long time for large files and it
doesn't report progress, so the call can time out.

As a work around, I'd recommend either increasing the timeout or
uploading the files by hand. Since you only have a few large files,
you might want to copy the files to local disk and then use something
like s3cmd to upload them to S3.

Regards,
Andrew

On Thu, May 7, 2009 at 4:42 PM, Ken Krugler kkrugler_li...@transpac.com wrote:
 Hi all,

 I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to
 S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two
 slaves.

 I first tried using the hadoop fs -cp command, as in:

 hadoop fs -cp output/dir/ s3n://bucket/dir/

 This seemed to be working, as I could walk the network traffic spike, and
 temp files were being created in S3 (as seen with CyberDuck).

 But then it seemed to hang. Nothing happened for 30 minutes, so I killed the
 command.

 Then I tried using the hadoop distcp command, as in:

 hadoop distcp hdfs://host:50001/path/dir/ s3://public key:private
 key@bucket/dir2/

 This failed, because my secret key has a '/' in it
 (http://issues.apache.org/jira/browse/HADOOP-3733)

 Then I tried using hadoop distcp with the s3n URI syntax:

 hadoop distcp hdfs://host:50001/path/dir/ s3n://bucket/dir2/

 Similar to my first attempt, it seemed to work. Lots of network activity,
 temp files being created, and in the terminal I got:

 09/05/07 18:36:11 INFO mapred.JobClient: Running job: job_200905071339_0004
 09/05/07 18:36:12 INFO mapred.JobClient:  map 0% reduce 0%
 09/05/07 18:36:30 INFO mapred.JobClient:  map 9% reduce 0%
 09/05/07 18:36:35 INFO mapred.JobClient:  map 14% reduce 0%
 09/05/07 18:36:38 INFO mapred.JobClient:  map 20% reduce 0%

 But again it hung. No network traffic, and eventually it dumped out:

 09/05/07 18:52:34 INFO mapred.JobClient: Task Id :
 attempt_200905071339_0004_m_01_0, Status : FAILED
 Task attempt_200905071339_0004_m_01_0 failed to report status for 601
 seconds. Killing!
 09/05/07 18:53:02 INFO mapred.JobClient: Task Id :
 attempt_200905071339_0004_m_04_0, Status : FAILED
 Task attempt_200905071339_0004_m_04_0 failed to report status for 602
 seconds. Killing!
 09/05/07 18:53:06 INFO mapred.JobClient: Task Id :
 attempt_200905071339_0004_m_02_0, Status : FAILED
 Task attempt_200905071339_0004_m_02_0 failed to report status for 602
 seconds. Killing!
 09/05/07 18:53:09 INFO mapred.JobClient: Task Id :
 attempt_200905071339_0004_m_03_0, Status : FAILED
 Task attempt_200905071339_0004_m_03_0 failed to report status for 601
 seconds. Killing!

 In the task GUI, I can see the same tasks failing, and being restarted. But
 the restarted tasks seem to be just hanging w/o doing anything.

 Eventually one of the tasks made a bit more progress, but then it finally
 died with:

 Copy failed: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.hadoop.tools.DistCp.copy(DistCp.java:647)
        at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)

 So - any thoughts on what's going wrong?

 Thanks,

 -- Ken
 --
 Ken Krugler
 +1 530-210-6378



Re: Is HDFS protocol written from scratch?

2009-05-07 Thread Philip Zeyliger
On Thu, May 7, 2009 at 1:04 PM, Foss User foss...@gmail.com wrote:

 On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi rang...@yahoo-inc.com
 wrote:
 
 
  Philip Zeyliger wrote:
 
  It's over TCP/IP, in a custom protocol.  See DataXceiver.java.  My sense
  is
  that it's a custom protocol because Hadoop's IPC mechanism isn't
 optimized
  for large messages.
 
  yes, and job classes are not distributed using this. It is a very simple
  protocol used to read and write raw data to DataNodes.

 How are the job class files or jar files distributed then?


I believe that the JobClient does write the job's files to HDFS (with
mapred.submit.replication replication factor) as part of job submission, and
writing to HDFS does use this interface.  (It also triggers other uses of
this interface: the data nodes stream copies of blocks to each other, I
believe)  What may be confusing is that the job configuration and such is
passed via IPC to the JobTracker separately.

-- Philip


Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-07 Thread Grace
Thanks all for your replying.

I have run several times with different Java options for Map/Reduce
tasks. However there is no much difference.

Following is the example of my test setting:
Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages
-XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch
Test B: -Xmx1024m
Test C: -Xmx1024m -XXaggressive

Is there any tricky or special setting for Jrockit vm on Hadoop?

In the Hadoop Quick Start guides, it says that JavaTM 1.6.x, preferably
from Sun. Is there any concern about the Jrockit performance issue?

I'd highly appreciate for your time and consideration.


On Fri, May 8, 2009 at 7:36 AM, JQ Hadoop jq.had...@gmail.com wrote:

 There are a lot of tuning knobs for the JRockit JVM when it comes to
 performance; those tuning can make a huge difference. I'm very interested
 if
 there are some tuning tips for Hadoop.

 Grace, what are the parameters that you used in your testing?

 Thanks,
 JQ

 On Thu, May 7, 2009 at 11:35 PM, Steve Loughran ste...@apache.org wrote:

  Chris Collins wrote:
 
  a couple of years back we did a lot of experimentation between sun's vm
  and jrocket.  We had initially assumed that jrocket was going to scream
  since thats what the press were saying.  In short, what we discovered
 was
  that certain jdk library usage was a little bit faster with jrocket, but
 for
  core vm performance such as synchronization, primitive operations the
 sun vm
  out performed.  We were not taking account of startup time, just raw
 code
  execution.  As I said, this was a couple of years back so things may of
  changed.
 
  C
 
 
  I run JRockit as its what some of our key customers use, and we need to
  test things. One lovely feature is tests time out before the stack runs
 out
  on a recursive operation; clearly different stack management at work.
  Another: no PermGenHeapSpace to fiddle with.
 
  * I have to turn debug logging of in hadoop test runs, or there are
  problems.
 
  * It uses short pointers (32 bits long) for near memory on a 64 bit JVM.
 So
  your memory footprint on sub-4GB VM images is better. Java7 promises
 this,
  and with the merger, who knows what we will see. This is unimportant  on
  32-bit boxes
 
  * debug single stepping doesnt work. That's ok, I use functional tests
  instead :)
 
  I havent looked at outright performance.
 
  /
 



Using Hadoop API through python

2009-05-07 Thread Aditya Desai
Hi All,
Is there any way that I can access the hadoop API through python. I am aware
that hadoop streaming can be used to create a mapper and reducer in a
different language but have not come accross any module that helps me apply
functions to manipulate data or control as is an option in java. First of
all is it possible to do this. If yes can you please tell me how.

Thanks,
Aditya.

-- 

George Burns http://www.brainyquote.com/quotes/authors/g/george_burns.html
- Happiness is having a large, loving, caring, close-knit family in
another city.


Re: Using Hadoop API through python

2009-05-07 Thread Amit Saha
On Fri, May 8, 2009 at 9:37 AM, Aditya Desai aditya3...@gmail.com wrote:
 Hi All,
 Is there any way that I can access the hadoop API through python. I am aware
 that hadoop streaming can be used to create a mapper and reducer in a
 different language but have not come accross any module that helps me apply
 functions to manipulate data or control as is an option in java. First of
 all is it possible to do this. If yes can you please tell me how.

You might want to try your hand at using Jython.

Just my 2-cents.

-Amit
-- 
http://amitksaha.blogspot.com
http://amitsaha.in.googlepages.com/
cornucopic on #scheme, #lisp, #math, #linux

*Bangalore Open Java Users Group*:http:www.bojug.in

Recursion is the basic iteration mechanism in Scheme
--- http://c2.com/cgi/wiki?TailRecursion


Re: Using Hadoop API through python

2009-05-07 Thread Zak Stone
You should consider using Dumbo to run Python jobs with Hadoop Streaming:

http://wiki.github.com/klbostee/dumbo

Dumbo is already very useful, and it is improving all the time.

Zak


On Fri, May 8, 2009 at 12:07 AM, Aditya Desai aditya3...@gmail.com wrote:
 Hi All,
 Is there any way that I can access the hadoop API through python. I am aware
 that hadoop streaming can be used to create a mapper and reducer in a
 different language but have not come accross any module that helps me apply
 functions to manipulate data or control as is an option in java. First of
 all is it possible to do this. If yes can you please tell me how.

 Thanks,
 Aditya.

 --

 George Burns http://www.brainyquote.com/quotes/authors/g/george_burns.html
 - Happiness is having a large, loving, caring, close-knit family in
 another city.



How to add user and group to hadoop?

2009-05-07 Thread Starry SHI
Hi, everyone! I am new to hadoop and recently I have set up a small
hadoop cluster and have several users access to it. However, I notice
that no matter which user login to HDFS and do some operations, the
files are always belong to the user DrWho in group Supergroup. HDFS
seems provide no access to create user or group inside it(only 'chown'
and 'chgrp'). Could anybody tell me how to add different user accounts
and groups to HDFS?

Thank you for your help!

Best,
Starry

/* Tomorrow is another day. So is today. */


Are the API changes in mapred-mapreduce in 0.20.0 usable?

2009-05-07 Thread Brian Ferris
I was upgraded to 0.20.0 last week and I noticed most everything in  
org.apache.hadoop.mapred.* has been deprecated.  However, I've not  
been having any luck getting the new Map-Reduce classes to work.   
Hadoop Streaming still seems to expect the old API and it doesn't seem  
that JobClient has been rewritten to use the new APIs either.  Is  
there an example of using the new APIs someplace?  Should I be using  
them at all?


Thanks,
Brian


Re: How to add user and group to hadoop?

2009-05-07 Thread Wang Zhong
read this doc:

http://hadoop.apache.org/core/docs/r0.20.0/hdfs_permissions_guide.html


On Fri, May 8, 2009 at 12:56 PM, Starry SHI starr...@gmail.com wrote:
 Hi, everyone! I am new to hadoop and recently I have set up a small
 hadoop cluster and have several users access to it. However, I notice
 that no matter which user login to HDFS and do some operations, the
 files are always belong to the user DrWho in group Supergroup. HDFS
 seems provide no access to create user or group inside it(only 'chown'
 and 'chgrp'). Could anybody tell me how to add different user accounts
 and groups to HDFS?

 Thank you for your help!

 Best,
 Starry

 /* Tomorrow is another day. So is today. */




-- 
Wang Zhong


Re: Are the API changes in mapred-mapreduce in 0.20.0 usable?

2009-05-07 Thread Jothi Padmanabhan
examples/wordcount has been modified to use the new API. Also, there is a
test case in the mapreduce directory that uses the new API.

Jothi 


On 5/8/09 10:59 AM, Brian Ferris bdfer...@cs.washington.edu wrote:

 I was upgraded to 0.20.0 last week and I noticed most everything in
 org.apache.hadoop.mapred.* has been deprecated.  However, I've not
 been having any luck getting the new Map-Reduce classes to work.
 Hadoop Streaming still seems to expect the old API and it doesn't seem
 that JobClient has been rewritten to use the new APIs either.  Is
 there an example of using the new APIs someplace?  Should I be using
 them at all?
 
 Thanks,
 Brian



Re: Are the API changes in mapred-mapreduce in 0.20.0 usable?

2009-05-07 Thread Brian Ferris

Thanks so much.  That did the trick.


On May 7, 2009, at 10:34 PM, Jothi Padmanabhan wrote:

examples/wordcount has been modified to use the new API. Also, there  
is a

test case in the mapreduce directory that uses the new API.

Jothi


On 5/8/09 10:59 AM, Brian Ferris bdfer...@cs.washington.edu wrote:


I was upgraded to 0.20.0 last week and I noticed most everything in
org.apache.hadoop.mapred.* has been deprecated.  However, I've not
been having any luck getting the new Map-Reduce classes to work.
Hadoop Streaming still seems to expect the old API and it doesn't  
seem

that JobClient has been rewritten to use the new APIs either.  Is
there an example of using the new APIs someplace?  Should I be using
them at all?

Thanks,
Brian