How to write simple programs using Hadoop?

2008-05-07 Thread Hadoop

Is there any chance to see some simple programs for Hadoop (such as Hello
world, counting numbers 1-10, reading two numbers and printing the larger
one, other number, string and file processing examples,...etc) written in
Java/C++. 

It seems that the only available public code on the world (Internet) is the
WordCount program. 
I learn programming easily and faster by examples and I would appreciate it
if anyone can share some simple programs written in Java/C++ for Hadoop .

If there is any manuals, examples, links about writing programs for Hadoop,
please share it.

-- 
View this message in context: 
http://www.nabble.com/How-to-write-simple-programs-using-Hadoop--tp17099073p17099073.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: How to write simple programs using Hadoop?

2008-05-07 Thread Arun C Murthy


On May 7, 2008, at 12:33 AM, Hadoop wrote:



Is there any chance to see some simple programs for Hadoop (such as  
Hello
world, counting numbers 1-10, reading two numbers and printing the  
larger
one, other number, string and file processing examples,...etc)  
written in

Java/C++.

It seems that the only available public code on the world  
(Internet) is the

WordCount program.
I learn programming easily and faster by examples and I would  
appreciate it
if anyone can share some simple programs written in Java/C++ for  
Hadoop .


If there is any manuals, examples, links about writing programs for  
Hadoop,

please share it.



Take a look at the src/examples directory in your hadoop distribution:
http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/org/ 
apache/hadoop/examples/

and
http://svn.apache.org/viewvc/hadoop/core/trunk/src/examples/pipes/impl/

Map-Reduce tutorial:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html

Hadoop Streaming:
http://hadoop.apache.org/core/docs/current/streaming.html

Arun


--
View this message in context: http://www.nabble.com/How-to-write- 
simple-programs-using-Hadoop--tp17099073p17099073.html

Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: Collecting output not to file

2008-05-07 Thread Amar Kamat

Derek Shaw wrote:

Hey,

From the examples that I have seen thus far, all of the results from the reduce 
function are being written to a file. Instead of writing results to a file, I 
want to store them

What do you mean by store and inspect?

 and inspect them after the job is completed. (I think that I need to implement 
my own OutputCollector, but I don't know how to tell hadoop to use it.) How can 
I do this?

-Derek

  




Re: single node Hbase

2008-05-07 Thread Yuri Kudryavcev
Try this one
http://hadoop.apache.org/hbase/docs/r0.1.1/api/overview-summary.html#overview_description
- Yuri.

On Wed, May 7, 2008 at 4:40 PM, Ahmed Shiraz Memon 
[EMAIL PROTECTED] wrote:

 the link is not working...
 Shiraz

 On Mon, Mar 17, 2008 at 9:34 PM, stack [EMAIL PROTECTED] wrote:

  Try our 'getting started':
  http://hadoop.apache.org/hbase/docs/current/api/index.html.
  St.Ack
 
 
 
  Peter W. wrote:
 
   Hello,
  
   Are there any Hadoop documentation resources showing
   how to run the current version of Hbase on a single node?
  
   Thanks,
  
   Peter W.
  
 
 



Not allow file split

2008-05-07 Thread Roberto Zandonati
Hi at all, I'm a newbie and I have the following problem.

I need to implement an InputFormat such that the isSplitable always
returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question
no 10).
And here there is the problem.

I have also to implement the RecordReader interface for returning the
whole content of the input file but I don't know how. I have found
only examples that uses the LineRecordReader

Someone can help me?

Thanks

-- 
Roberto Zandonati


Re: Not allow file split

2008-05-07 Thread Rahul Sood
You can implement a custom input format and a record reader. Assuming
your record data type is class RecType, the input format should subclass
FileInputFormat LongWritable, RecType  and the record reader should
implement RecordReader  LongWritable, RecType 

In this case the key could be the offset into the file, although it is
not very useful since you treat the entire file as one record. 

The isSplitable() method in the input format should return false.
The RecordReader.next( LongWritable pos, RecType val ) method should
read the entire file and set val to the file contents. This will ensure
that the entire file goes to one map task as a single record.

-Rahul Sood
[EMAIL PROTECTED]

 Hi at all, I'm a newbie and I have the following problem.
 
 I need to implement an InputFormat such that the isSplitable always
 returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question
 no 10).
 And here there is the problem.
 
 I have also to implement the RecordReader interface for returning the
 whole content of the input file but I don't know how. I have found
 only examples that uses the LineRecordReader
 
 Someone can help me?
 
 Thanks
 



Where is the files?

2008-05-07 Thread hong

Hi All,

I started Hadoop in standalone mode, and put some file on to HDSF. I  
strictly followed the instructions in Hadoop Quick Start.


HDSF is mapped to a local directory in my local file system, right?  
and where is it?


Thank you in advance!




Re: Where is the files?

2008-05-07 Thread vikas
it will be mapped to /tmp -- equivalanet to drive of HADOOP_ROOT/tmp in
windows

Regards,
-Vikas.

On Wed, May 7, 2008 at 8:06 PM, hong [EMAIL PROTECTED] wrote:

 Hi All,

 I started Hadoop in standalone mode, and put some file on to HDSF. I
 strictly followed the instructions in Hadoop Quick Start.

 HDSF is mapped to a local directory in my local file system, right? and
 where is it?

 Thank you in advance!





Re: Not allow file split

2008-05-07 Thread Arun C Murthy


On May 7, 2008, at 6:30 AM, Roberto Zandonati wrote:


Hi at all, I'm a newbie and I have the following problem.

I need to implement an InputFormat such that the isSplitable always
returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question
no 10).
And here there is the problem.

I have also to implement the RecordReader interface for returning the
whole content of the input file but I don't know how. I have found
only examples that uses the LineRecordReader



Couple of things.

1. Take a look at SequenceFileRecordReader: http://svn.apache.org/ 
viewvc/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/ 
SequenceFileRecordReader.java?view=log


2. If you just want to process a text file as a while or a sequence  
file as whole (or any existing one) you do not need to implement a  
'RecordReader' at all. Just sub-class the InputFormat, override the  
isSplittable and the RecordReader will work correctly. Take a look at  
SortValidtor (http://svn.apache.org/viewvc/hadoop/core/trunk/src/test/ 
org/apache/hadoop/mapred/SortValidator.java) and how it sub-classes  
SequenceFileInputFormat to implement a  
NonSplittableSequenceFileInputFormat.


Arun



Re: Where is the files?

2008-05-07 Thread Hairong Kuang
DFS files are mapped into blocks. Blocks are stored under
dfs.data.dir/current.

Hairong


On 5/7/08 7:36 AM, hong [EMAIL PROTECTED] wrote:

 Hi All,
 
 I started Hadoop in standalone mode, and put some file on to HDSF. I
 strictly followed the instructions in Hadoop Quick Start.
 
 HDSF is mapped to a local directory in my local file system, right?
 and where is it?
 
 Thank you in advance!
 
 



Read timed out, Abandoning block blk_-5476242061384228962

2008-05-07 Thread James Moore
What is this bit of the log trying to tell me, and what sorts of
things should I be looking at to make sure it doesn't happen?

I don't think the network has any basic configuration issues - I can
telnet from the machine creating this log to the destination - telnet
10.252.222.239 50010 works fine when I ssh in to the box with this
error.

2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Abandoning block blk_-5476242061384228962
2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting
to find target node: 10.252.222.239:50010

I'm seeing a fair number of these.  My reduces finally complete, but
there are usually a couple at the end that take longer than I think
they should, and they frequently have these sorts of errors.

I'm running 20 machines on ec2 right now, with hadoop version 0.16.4.
-- 
James Moore | [EMAIL PROTECTED]
blog.restphone.com


Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-07 Thread James Moore
I noticed that there was a hard-coded timeout value of 6000 (ms) in
src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I
took that way down and now I'm not noticing the problem.  (Doesn't
mean it's not there, I just don't feel the pain...)

This feels like a terrible solution^H^H^H^H^H^hack though,
particularly since I haven't yet taken the time to actually understand
the code.

-- 
James Moore | [EMAIL PROTECTED]
blog.restphone.com


Hadoop Permission Problem

2008-05-07 Thread Natarajan, Senthil
Hi,
My datanode and jobtracker are started by user hadoop.
And user Test needs to submit the job. So if the user Test copies file to 
HDFS, there is a permission error.
/usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/Test/somefile.txt myapps
copyFromLocal: org.apache.hadoop.fs.permission.AccessControlException: 
Permission denied: user=Test, access=WRITE, 
inode=user:hadoop:supergroup:rwxr-xr-x
Could you please let me know how other users (other than hadoop) can access 
HDFS and then submit MapReduce jobs. Where to configure or what default 
configuration needs to be changed.

Thanks,
Senthil



Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-07 Thread Hairong Kuang
Taking the timeout out is very dangerous. It may cause your application to
hang. You could change the timeout parameter to a larger number. HADOOP-2188
fixed the problem. Check https://issues.apache.org/jira/browse/HADOOP-2188.

Hairong

On 5/7/08 2:36 PM, James Moore [EMAIL PROTECTED] wrote:

 I noticed that there was a hard-coded timeout value of 6000 (ms) in
 src/java/org/apache/hadoop/dfs/DFSClient.java - as an experiment, I
 took that way down and now I'm not noticing the problem.  (Doesn't
 mean it's not there, I just don't feel the pain...)
 
 This feels like a terrible solution^H^H^H^H^H^hack though,
 particularly since I haven't yet taken the time to actually understand
 the code.



Re: Read timed out, Abandoning block blk_-5476242061384228962

2008-05-07 Thread Chris K Wensel

Hi James

Were you able to start all the nodes in the same 'availability zone'?  
You using the new AMI kernels?


If you are using the contrib/ec2 scripts, you might upgrade (just the  
scripts) to

http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.17/src/contrib/ec2/

These support the new kernels and availability zones. My transient  
errors went away when upgrading.


The functional changes are documented here:
http://wiki.apache.org/hadoop/AmazonEC2

fyi, you will need to build your own images (via the create-image  
command) with whatever version of Hadoop you are comfortable with.  
this will also get you a Ganglia install...


ckw

On May 7, 2008, at 1:29 PM, James Moore wrote:


What is this bit of the log trying to tell me, and what sorts of
things should I be looking at to make sure it doesn't happen?

I don't think the network has any basic configuration issues - I can
telnet from the machine creating this log to the destination - telnet
10.252.222.239 50010 works fine when I ssh in to the box with this
error.

2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Exception in createBlockOutputStream java.net.SocketTimeoutException:
Read timed out
2008-05-07 13:20:31,194 INFO org.apache.hadoop.dfs.DFSClient:
Abandoning block blk_-5476242061384228962
2008-05-07 13:20:31,196 INFO org.apache.hadoop.dfs.DFSClient: Waiting
to find target node: 10.252.222.239:50010

I'm seeing a fair number of these.  My reduces finally complete, but
there are usually a couple at the end that take longer than I think
they should, and they frequently have these sorts of errors.

I'm running 20 machines on ec2 right now, with hadoop version 0.16.4.
--
James Moore | [EMAIL PROTECTED]
blog.restphone.com


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Re: Hadoop Permission Problem

2008-05-07 Thread s29752-hadoopuser
Hi Senthil,

Since the path myapps is relative, copyFromLocal will copy the file to the 
home directory, i.e. /user/Test/myapps in your case.  If /user/Test doesn't not 
exist, it will first try to create it.  You got AccessControlException because 
the permission of /user is 755.

Hope this helps.

Nicholas



- Original Message 
From: Natarajan, Senthil [EMAIL PROTECTED]
To: [EMAIL PROTECTED] [EMAIL PROTECTED]
Sent: Wednesday, May 7, 2008 2:36:22 PM
Subject: Hadoop Permission Problem

Hi,
My datanode and jobtracker are started by user hadoop.
And user Test needs to submit the job. So if the user Test copies file to 
HDFS, there is a permission error.
/usr/local/hadoop/bin/hadoop dfs -copyFromLocal /home/Test/somefile.txt myapps
copyFromLocal: org.apache.hadoop.fs.permission.AccessControlException: 
Permission denied: user=Test, access=WRITE, 
inode=user:hadoop:supergroup:rwxr-xr-x
Could you please let me know how other users (other than hadoop) can access 
HDFS and then submit MapReduce jobs. Where to configure or what default 
configuration needs to be changed.

Thanks,
Senthil

Fwd: Collecting output not to file

2008-05-07 Thread Derek Shaw
To clarify:
 
 static class TestOutputFormat
 implements OutputFormat Text, Text
 {
 static class TestRecordWriter
 implements RecordWriter Text, Text
 {
 TestOutputFormat output;
 
 public TestRecordWriter (TestOutputFormat output, 
org.apache.hadoop.fs.FileSystem ignored, JobConf job, String name, Progressable 
progress)
 {
 this.output = output;
 }
 
 public void close (Reporter reporter)
 {}
 
 public void write (Text key, Text value)
 {
 output.addResults (value.toString ());
 }
 }
 
 protected String results = ;
 
 public void checkOutputSpecs (org.apache.hadoop.fs.FileSystem ignored, 
JobConf job)
 throws IOException
 {}
 
 public RecordWriter Text, Text getRecordWriter 
(org.apache.hadoop.fs.FileSystem ignored, JobConf job, String name, 
Progressable progress)
 {
 return new TestRecordWriter (this, ignored, job, name, progress);
 }
 
 public void addResults (String r)
 {
 results += r + ,;
 }
 
 public String getResults ()
 {
 return results;
 }
 }

 And then running the task:
 public int run(String[] args) 
 throws Exception 
 {
 
 JobClient.runJob(job);
 
 // getOutputFormatcreates a new instance of the outputformat. I want to 
get the instance of the output format that the reduce function wrote to
 // The recordWriter that reduce wrote to would be just as good
 TestOutputFormat results = (TestOutputFormat) job.getOutputFormat ();  
   
 // Always prints the empty string, not the populated results
 System.out.println (results:  + results.getResults ());   
 
 return 0;
 }

Derek Shaw [EMAIL PROTECTED] wrote: Date: Tue, 6 May 2008 23:26:30 -0400 (EDT)
From: Derek Shaw [EMAIL PROTECTED]
Subject: Collecting output not to file
To: core-user@hadoop.apache.org

 Hey,

From the examples that I have seen thus far, all of the results from the 
reduce function are being written to a file. Instead of writing results to a 
file, I want to store them and inspect them after the job is completed. (I 
think that I need to implement my own OutputCollector, but I don't know how to 
tell hadoop to use it.) How can I do this?

-Derek