from:"Rasit OZDAS"

Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-14 Thread Rasit OZDAS

Why don't you use it with localhost? Does it have a disadvantage?
As far as I know, there were several host = IP problems in hadoop, but
that was a while ago, I think these should have been solved..

It's can also be about the order of IP conversions in IP table file.

2009/5/14 andy2005cst andy2005...@gmail.com

when set the IP to localhost, it works well, but if change localhost into
IP
address, it does not work at all.
so, it is to say my hadoop is ok, just the connection failed.

Rasit OZDAS wrote:

Your hadoop isn't working at all or isn't working at the specified port.
- try stop-all.sh command on namenode. if it says no namenode to stop,
then take a look at namenode logs and paste here if anything seems
strange.
- If namenode logs are ok (filled with INFO messages), then take a look
at
all logs.
- In eclipse plugin, left side is for map reduce port, right side is for
namenode port, make sure both are same as your configuration in xml files

2009/5/12 andy2005cst andy2005...@gmail.com

when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
connect
to a remote hadoop dfs, i got ioexception. if run a map/reduce program
it
outputs:
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 0 time(s).
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 1 time(s).
09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
/**.**.**.**:9100. Already tried 2 time(s).

Exception in thread main java.io.IOException: Call to
/**.**.**.**:9100
failed on local exception: java.net.SocketException: Connection refused:
connect

looking forward your help. thanks a lot.
--
View this message in context:

http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
M. Raşit ÖZDAŞ

--
View this message in context:
http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23533748.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
M. Raşit ÖZDAŞ

Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-12 Thread Rasit OZDAS

Your hadoop isn't working at all or isn't working at the specified port.
- try stop-all.sh command on namenode. if it says no namenode to stop,
then take a look at namenode logs and paste here if anything seems strange.
- If namenode logs are ok (filled with INFO messages), then take a look at
all logs.
- In eclipse plugin, left side is for map reduce port, right side is for
namenode port, make sure both are same as your configuration in xml files

2009/5/12 andy2005cst andy2005...@gmail.com


 when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
 connect
 to a remote hadoop dfs, i got ioexception. if run a map/reduce program it
 outputs:
 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
 /**.**.**.**:9100. Already tried 0 time(s).
 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
 /**.**.**.**:9100. Already tried 1 time(s).
 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
 /**.**.**.**:9100. Already tried 2 time(s).
 
 Exception in thread main java.io.IOException: Call to /**.**.**.**:9100
 failed on local exception: java.net.SocketException: Connection refused:
 connect

 looking forward your help. thanks a lot.
 --
 View this message in context:
 http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-12 Thread Rasit OZDAS

I have the similar situation, I have very small files,
I never tried HBase (want to), but you can also group them
and write (let's say) 20-30 into a file as every file becomes a key in that
big file.

There are methods in API which you can write an object as a file into HDFS,
and read again
to get original object. Having list of items in object can solve this
problem..

Re: Distributed Agent

2009-04-15 Thread Rasit OZDAS

Take a look at this topic:

http://dsonline.computer.org/portal/site/dsonline/menuitem.244c5fa74f801883f1a516106bbe36ec/index.jsp?pName=dso_level1_aboutpath=dsonline/topics/agentsfile=about.xmlxsl=generic.xsl;

2009/4/14 Burak ISIKLI burak.isi...@yahoo.com:
 Hello everyone;
 I want to write a distributed agent program. But i can't understand one thing 
 that what's difference between client-server program and agent program? Pls 
 help me...




 
 Burak ISIKLI
 Dumlupinar University
 Electric  Electronic - Computer Engineering

 http://burakisikli.wordpress.com
 http://burakisikli.blogspot.com
 







-- 
M. Raşit ÖZDAŞ

Re: Ynt: Re: Cannot access Jobtracker and namenode

2009-04-13 Thread Rasit OZDAS

It's normal that they are all empty. Look at files with .log extension.

12 Nisan 2009 Pazar 23:30 tarihinde halilibrahimcakir
halilibrahimca...@mynet.com yazdı:
 I followed these steps:

 $ bin/stop-all.sh
 $ rm -ri /tmp/hadoop-root
 $ bin/hadoop namenode -format
 $ bin/start-all.sh

 and looked localhost:50070 and localhost:50030 in my browser that the
 result was not different. Again Error 404. I looked these files:

 $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out1
 $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out2
 $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out3
 $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out4

 4th file is the last one related to namenode logs in the logs directory.
 All of them are empty. I don't understand what is wrong.

 - Özgün İleti -
 Kimden : core-user@hadoop.apache.org
 Kime : core-user@hadoop.apache.org
 Gönderme tarihi : 12/04/2009 22:56
 Konu : Re: Cannot access Jobtracker and namenode
 Try looking at namenode logs (under logs directory). There should be
 an exception. Paste it here if you don't understand what it means.

 12 Nisan 2009 Pazar 22:22 tarihinde halilibrahimcakir
 lt;halilibrahimca...@mynet.comgt;
 yazdı:
 gt; I typed:
 gt;
 gt; $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 gt; $ cat ~/.ssh/id_dsa.pub gt;gt; ~/.ssh/authorized_keys
 gt;
 gt; Deleted this directory:
 gt;
 gt; $ rm -ri /tmp/hadoop-root
 gt;
 gt; Formatted namenode again:
 gt;
 gt; $ /bin/hadoop namenode -format
 gt;
 gt; Stopped:
 gt;
 gt; $ /bin/stop-all.sh
 gt;
 gt;
 gt; then typed:
 gt;
 gt;
 gt;
 gt; $ ssh localhost
 gt;
 gt; so it didn't want me to enter a password. I started:
 gt;
 gt; $ /bin/start-all.sh
 gt;
 gt; But nothing changed :(
 gt;
 gt; - Özgün İleti -
 gt; Kimden : core-user@hadoop.apache.org
 gt; Kime : core-user@hadoop.apache.org
 gt; Gönderme tarihi : 12/04/2009 21:33
 gt; Konu : Re: Ynt: Re: Cannot access Jobtracker and namenode
 gt; There are two commands in hadoop quick start, used for passwordless
 ssh.
 gt; Try those.
 gt;
 gt; $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 gt; $ cat ~/.ssh/id_dsa.pub gt;gt; ~/.ssh/authorized_keys
 gt;
 gt; http://hadoop.apache.org/core/docs/current/quickstart.html
 gt;
 gt; --
 gt; M. Raşit ÖZDAŞ
 gt;
 gt; Halil İbrahim ÇAKIR
 gt;
 gt; Dumlupınar Üniversitesi Bilgisayar Mühendisliği
 gt;
 gt; http://cakirhal.blogspot.com
 gt;
 gt;



 --
 M. Raşit ÖZDAŞ

 Halil İbrahim ÇAKIR

 Dumlupınar Üniversitesi Bilgisayar Mühendisliği

 http://cakirhal.blogspot.com





-- 
M. Raşit ÖZDAŞ

Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS

Does your system request a password when you ssh to localhost outside hadoop?

12 Nisan 2009 Pazar 20:51 tarihinde halilibrahimcakir
halilibrahimca...@mynet.com yazdı:

 Hi

 I am new at hadoop. I downloaded Hadoop-0.19.0 and followed the
 instructions in the quick start
 manual(http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html). When I
 came to Pseudo-Distributed Operation section there was no problem but
 localhost:50070 and localhost:50030 couldn't be opened. It says localhost
 reffused the connection. I tried this in another machine, but it says
 like Http Error 404: /dfshealth.jsp  How can I see these pages and
 continue using hadoop? Thanks.

 Additional Information:

 OS: Debian 5.0 (latest version)
 JDK: Sun-Java 1.6 (latest version)
  rsync and ssh installed
 edited hadoop-site-xml properly

 Halil İbrahim ÇAKIR

 Dumlupınar Üniversitesi Bilgisayar Mühendisliği

 http://cakirhal.blogspot.com





-- 
M. Raşit ÖZDAŞ

Re: Ynt: Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS

There are two commands in hadoop quick start, used for passwordless ssh.
Try those.

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub  ~/.ssh/authorized_keys

http://hadoop.apache.org/core/docs/current/quickstart.html

-- 
M. Raşit ÖZDAŞ

Re: Web ui

2009-04-08 Thread Rasit OZDAS

@Nick, I'm using ajax very often and previously done projects with ZK
and JQuery, I can easily say that GWT was the easiest of them.
Javascript is only needed where core features aren't enough. I can
easily assume that we won't need any inline javascript.

@Philip,
Thanks for the point. That is a better solution than I imagine,
actually, and I won't have to wait since it's a resolved issue.

-- 
M. Raşit ÖZDAŞ

Web ui

2009-04-07 Thread Rasit OZDAS

Hi,

I started to write my own web ui with GWT. With GWT I can manage
everything within one page, I can set refreshing durations for
each part of the page. And also a better look and feel with the help
of GWT styling.

But I can't get references of NameNode and JobTracker instances.
I found out that they're sent to web ui as application parameters when
hadoop initializes.

I'll try to contribute gui part of my project to hadoop source, if you
want, no problem.
But I need static references to namenode and jobtracker for this.

And I think it will be useful for everyone like me.

M. Rasit OZDAS

Re: mapreduce problem

2009-04-02 Thread Rasit OZDAS

MultipleOutputFormat would be what you want. It supplies multiple files as
output.
I can paste some code here if you want..

2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in

 Hi,

 I am new to map-reduce programming model ,

  I am writing a MR that will process the log file and results are written
 to
 different files on hdfs  based on some values in the log file

The program is working fine even if I haven't done any
 processing in reducer ,I am not getting how to use reducer for solving my
 problem efficiently

 Can anybody please help me on this.




 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.




-- 
M. Raşit ÖZDAŞ

Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS

Hi, hadoop is normally designed to write to disk. There are a special file
format, which writes output to RAM instead of disk.
But I don't have an idea if it's what you're looking for.
If what you said exists, there should be a mechanism which sends output as
objects rather than file content across computers, as far as I know there is
no such feature yet.

Good luck.

2009/4/2 andy2005cst andy2005...@gmail.com


 I need to use the output of the reduce, but I don't know how to do.
 use the wordcount program as an example if i want to collect the wordcount
 into a hashtable for further use, how can i do?
 the example just show how to let the result onto disk.
 myemail is : andy2005...@gmail.com
 looking forward your help. thanks a lot.
 --
 View this message in context:
 http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: reducer in M-R

2009-04-02 Thread Rasit OZDAS

Since every file name is different, you have a unique key for each map
output.
That means, every iterator has only one element. So you won't need to search
for a given name.
But it's possible that I misunderstood you.

2009/4/2 Vishal Ghawate vishal_ghaw...@persistent.co.in

 Hi ,

 I just wanted to know that values parameter passed to the reducer is always
 iterator ,

 Which is then used to iterate through for particular key

 Now I want to use file name as key and file content as its value

 So how can I set the parameters in the reducer



 Can anybody please help me on this.


 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.




-- 
M. Raşit ÖZDAŞ

Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Rasit OZDAS

Hi, Sim,

I've two suggessions, if you haven't done yet:

1. Check if your other hosts can ssh to master.
2. Take a look at logs of other hosts.

2009/4/2 Puri, Aseem aseem.p...@honeywell.com


 Hi

I have a small Hadoop cluster with 3 machines. One is my
 NameNode/JobTracker + DataNode/TaskTracker and other 2 are
 DataNode/TaskTracker. So I have made all 3 as slave.



 In slave file I have put names of all there machines as:



 master

 slave

 slave1



 When I start Hadoop cluster it always start DataNode/TaskTracker on last
 slave in the list and do not start DataNode/TaskTracker on other two
 machines. Also I got the message as:



 slave1:

 : no address associated with name

 : no address associated with name

 slave1: starting datanode, logging to
 /home/HadoopAdmin/hadoop/bin/../logs/hadoo

 p-HadoopAdmin-datanode-ie11dtxpficbfise.out



 If I change the order in slave file like this:



 slave

 slave1

 master



 then DataNode/TaskTracker on master m/c starts and not on other two.



 Please tell how I should solve this problem.



 Sim




-- 
M. Raşit ÖZDAŞ

Re: Strange Reduce Bahavior

2009-04-02 Thread Rasit OZDAS

Yes, we've constructed a local version of a hadoop process,
We needed 500 input files in hadoop to reach the speed of local process,
total time was 82 seconds in a cluster of 6 machines.
And I think it's a good performance among other distributed processing
systems.

2009/4/2 jason hadoop jason.had...@gmail.com

 3) The framework is designed for working on large clusters of machines
 where
 there needs to be a little delay between operations to avoid massive
 network
 loading spikes, and the initial setup of the map task execution environment
 on a machine, and the initial setup of the reduce task execution
 environment
 take a bit of time.
 In production jobs, these delays and setup times are lost in the overall
 task run time.
 In the small test job case the delays and setup times will be the bulk of
 the time spent executing the test.

Re: Reducer side output

2009-04-02 Thread Rasit OZDAS

I think it's about that you have no right to access to the path you define.
Did you try it with a path under your user directory?

You can change permissions from console.

2009/4/1 Nagaraj K nagar...@yahoo-inc.com

 Hi,

 I am trying to do a side-effect output along with the usual output from the
 reducer.
 But for the side-effect output attempt, I get the following error.

 org.apache.hadoop.fs.permission.AccessControlException:
 org.apache.hadoop.fs.permission.AccessControlException: Permission denied:
 user=nagarajk, access=WRITE, inode=:hdfs:hdfs:rwxr-xr-x
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
at
 org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.init(DFSClient.java:2311)
at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477)
at
 org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383)
at
 org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310)
at
 org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)

 My reducer code;
 =
 conf.set(group_stat, some_path); // Set during the configuration of
 jobconf object

 public static class ReducerClass extends MapReduceBase implements
 ReducerText,DoubleWritable,Text,DoubleWritable {
FSDataOutputStream part=null;
JobConf conf;

public void reduce(Text key, IteratorDoubleWritable values,
   OutputCollectorText,DoubleWritable output,
   Reporter reporter) throws IOException {
double i_sum = 0.0;
while (values.hasNext()) {
i_sum += ((Double) values.next()).valueOf();
}
String [] fields = key.toString().split(SEP);
if(fields.length==1)
{
   if(part==null)
   {
   FileSystem fs = FileSystem.get(conf);
String jobpart =
 conf.get(mapred.task.partition);
part = fs.create(new
 Path(conf.get(group_stat),/part-000+jobpart)) ; // Failing here
   }
   part.writeBytes(fields[0] +\t + i_sum +\n);

}
else
output.collect(key, new DoubleWritable(i_sum));
}
 }

 Can you guys let me know what I am doing wrong here!.

 Thanks
 Nagaraj K




-- 
M. Raşit ÖZDAŞ

Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS

Yes, as an additional info,
you can use this code just to start the job, not wait until it's finished:

JobClient client = new JobClient(conf);
client.runJob(conf);

2009/4/1 javateck javateck javat...@gmail.com

 you can run from java program:

JobConf conf = new JobConf(MapReduceWork.class);

// setting your params

JobClient.runJob(conf);

Re: what change to be done in OutputCollector to print custom writable object

2009-04-02 Thread Rasit OZDAS

There is also a good alternative,
We use ObjectInputFormat and ObjectRecordReader.
With it you can easily do File - Object translations.
I can send a code sample to your mail if you want.

Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS

If performance is important to you, Look at the quote from a previous
thread:

HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer storage
devices to servers in such a way that, to the operating system, the devices
appear as locally attached. Or you can even go for AmazonS3, if the data is
really authentic. For opensource solution related to SAN, you can go with
any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
Server + XSan.

--nitesh

Besides, I wouldn't use HDFS for this purpose.

Rasit

Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS

I doubt If I understood you correctly, but if so, there is a previous
thread to better understand what hadoop is intended to be, and what
disadvantages it has:
http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html

2009/4/2 Rasit OZDAS rasitoz...@gmail.com

 If performance is important to you, Look at the quote from a previous thread:

 HDFS is a file system for distributed storage typically for distributed
 computing scenerio over hadoop. For office purpose you will require a SAN
 (Storage Area Network) - an architecture to attach remote computer storage
 devices to servers in such a way that, to the operating system, the devices
 appear as locally attached. Or you can even go for AmazonS3, if the data is
 really authentic. For opensource solution related to SAN, you can go with
 any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
 zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
 Server + XSan.

 --nitesh

 Besides, I wouldn't use HDFS for this purpose.

 Rasit



--
M. Raşit ÖZDAŞ

Re: hdfs-doubt

2009-04-02 Thread Rasit OZDAS

It seems that either NameNode or DataNode is not started.
You can take a look at log files, and paste related lines here.

2009/3/29 deepya m_dee...@yahoo.co.in:

 Thanks,

 I have another doubt.I just want to run the examples and see how it works.I
 am trying to copy the file from local file system to hdfs using the command

  bin/hadoop fs -put conf input

 It is giving the following error.
 09/03/29 05:50:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.net.NoRouteToHostException: No route to host
 09/03/29 05:50:54 INFO hdfs.DFSClient: Abandoning block
 blk_-5733385806393158149_1053

 I have only one datanode in my cluster and my replication factor is also
 1(as configured in the conf file in hadoop-site.xml).Can you please provide
 the solution for this.


 Thanks in advance

 SreeDeepya


 sree deepya wrote:

 Hi sir/madam,

 I am SreeDeepya,doing Mtech in IIIT.I am working on a project named cost
 effective and scalable storage server.Our main goal of the project is to
 be
 able to store images in a server and the data can be upto petabytes.For
 that
 we are using HDFS.I am new to hadoop and am just learning about it.
     Can you please clarify some of the doubts I have.



 At present we configured one datanode and one namenode.Jobtracker is
 running
 on namenode and tasktracker on datanode.Now namenode also acts as
 client.Like we are writing programs in the namenode to store or retrieve
 images.My doubts are

 1.Can we put the client and namenode in two separate systems?

 2.Can we access the images from the datanode of hadoop cluster from a
 machine in which hdfs is not there?

 3.At present we may not have data upto petabytes but will be in
 gigabytes.Is
 hadoop still efficient in storing mega and giga bytes of data


 Thanking you,

 Yours sincerely,
 SreeDeepya



 --
 View this message in context: 
 http://www.nabble.com/hdfs-doubt-tp22764502p22765332.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
M. Raşit ÖZDAŞ

Re: Identify the input file for a failed mapper/reducer

2009-04-02 Thread Rasit OZDAS

Two quotes for this problem:

Streaming map tasks should have a map_input_file environment
variable like the following:
map_input_file=hdfs://HOST/path/to/file

the value for map.input.file gives you the exact information you need.

(didn't try)
Rasit

2009/3/26 Jason Fennell jdfenn...@gmail.com:
 Is there a way to identify the input file a mapper was running on when
 it failed?  When a large job fails because of bad input lines I have
 to resort to rerunning the entire job to isolate a single bad line
 (since the log doesn't contain information on the file that that
 mapper was running on).

 Basically, I would like to be able to do one of the following:
 1. Find the file that a mapper was running on when it failed
 2. Find the block that a mapper was running on when it failed (and be
 able to find file names from block ids)

 I haven't been able to find any documentation on facilities to
 accomplish either (1) or (2), so I'm hoping someone on this list will
 have a suggestion.

 I am using the Hadoop streaming API on hadoop 0.18.2.

 -Jason




-- 
M. Raşit ÖZDAŞ

Re: Reduce doesn't start until map finishes

2009-03-24 Thread Rasit OZDAS

Just to inform, we installed v.0.21.0-dev and there is no such issue now.

2009/3/6 Rasit OZDAS rasitoz...@gmail.com

 So, is there currently no solution to my problem?
 Should I live with it? Or do we have to have a JIRA for this?
 What do you think?


 2009/3/4 Nick Cen cenyo...@gmail.com

 Thanks, about the Secondary Sort, can you provide some example. What does
 the intermediate keys stands for?

 Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
 and
 the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
 partition and k1  k2, so i think the order inside reducer maybe:
 (k1,v1)
 (k1,v3)
 (k2,v2)
 (k2,v4)

 can the Secondary Sort change this order?



 2009/3/4 Chris Douglas chri...@yahoo-inc.com

  The output of each map is sorted by partition and by key within that
  partition. The reduce merges sorted map output assigned to its partition
  into the reduce. The following may be helpful:
 
  http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
 
  If your job requires total order, consider
  o.a.h.mapred.lib.TotalOrderPartitioner. -C
 
 
  On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
 
   can you provide more info about sortint? The sort is happend on the
 whole
  data set, or just on the specified partion?
 
  2009/3/4 Mikhail Yakshin greycat.na@gmail.com
 
   On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
 
  This is normal behavior. The Reducer is guaranteed to receive all the
  results for its partition in sorted order. No reduce can start until
 all
 
  the
 
  maps are completed, since any running map could emit a result that
 would
  violate the order for the results it currently has. -C
 
 
  _Reducers_ usually start almost immediately and start downloading data
  emitted by mappers as they go. This is their first phase. Their second
  phase can start only after completion of all mappers. In their second
  phase, they're sorting received data, and in their third phase they're
  doing real reduction.
 
  --
  WBR, Mikhail Yakshin
 
 
 
 
  --
  http://daily.appspot.com/food/
 
 
 


 --
 http://daily.appspot.com/food/




 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Running Balancer from API

2009-03-23 Thread Rasit OZDAS

Hi,

I try to start balancer from API
(org.apache.hadoop.hdfs.server.balancer.Balancer.main() ), but I get
NullPointerException.

09/03/23 15:17:37 ERROR dfs.Balancer: java.lang.NullPointerException
at org.apache.hadoop.dfs.Balancer.run(Balancer.java:1453)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.dfs.Balancer.main(Balancer.java:792)

It's this line (Balancer.java:1453):
fs.delete(BALANCER_ID_PATH, true);

The process doesn't start at all, I assume, It tries to delete a path that
balancer didn't put yet.
Is this an issue?

Rasit

Re: Does hadoop-default.xml + hadoop-site.xml matter for whole cluster or each node?

2009-03-09 Thread Rasit OZDAS

Some parameters are global (I can't give an example now),
they are cluster-wide even if they're defined in hadoop-site.xml

Rasit

2009/3/9 Nick Cen cenyo...@gmail.com

 for Q1: i think so , but i think it is a good practice to keep the
 hadoop-default.xml untouched.
 for Q2: i use this property for debugging in eclipse.



 2009/3/9 pavelkolo...@gmail.com

 
 
   The hadoop-site.xml will take effect only on that specified node. So
 each
  node can have its own configuration with hadoop-site.xml.
 
 
  As i understand, parameters in hadoop-site overwrites these ones in
  hadoop-default.
  So hadoop-default also individual for each node?
 
  Q2: what means local as value of mapred.job.tracker?
 
  thanks
 



 --
 http://daily.appspot.com/food/




-- 
M. Raşit ÖZDAŞ

MultipleOutputFormat with sorting functionality

2009-03-09 Thread Rasit OZDAS

Hi, all!

I'm using multiple output format to write out 4 different files, each one
has the same type.
But it seems that outputs aren't being sorted.

Should they be sorted? Or isn't it implemented for multiple output format?

Here is some code:

// in main function
MultipleOutputs.addMultiNamedOutput(conf, text, TextOutputFormat.class,
DoubleWritable.class, Text.class);

// in Reducer.configure()
mos = new MultipleOutputs(conf);

// in Reducer.reduce()
if (keystr.equalsIgnoreCase(BreachFace))
mos.getCollector(text, BreachFace, reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase(Ejector))
mos.getCollector(text, Ejector, reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase(FiringPin))
mos.getCollector(text, FiringPin, reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase(WeightedSum))
mos.getCollector(text, WeightedSum,
reporter).collect(new Text(key), dbl);
else
mos.getCollector(text, Diger, reporter).collect(new
Text(key), dbl);


-- 
M. Raşit ÖZDAŞ

Re: MapReduce jobs with expensive initialization

2009-03-06 Thread Rasit OZDAS

Owen, I tried this, it doesn't work.
I doubt if static singleton method will work either,
since it's much or less the same.

Rasit

2009/3/2 Owen O'Malley omal...@apache.org


 On Mar 2, 2009, at 3:03 AM, Tom White wrote:

  I believe the static singleton approach outlined by Scott will work
 since the map classes are in a single classloader (but I haven't
 actually tried this).


 Even easier, you should just be able to do it with static initialization in
 the Mapper class. (I haven't tried it either... )

 -- Owen




-- 
M. Raşit ÖZDAŞ

Re: Reduce doesn't start until map finishes

2009-03-05 Thread Rasit OZDAS

So, is there currently no solution to my problem?
Should I live with it? Or do we have to have a JIRA for this?
What do you think?


2009/3/4 Nick Cen cenyo...@gmail.com

 Thanks, about the Secondary Sort, can you provide some example. What does
 the intermediate keys stands for?

 Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
 and
 the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
 partition and k1  k2, so i think the order inside reducer maybe:
 (k1,v1)
 (k1,v3)
 (k2,v2)
 (k2,v4)

 can the Secondary Sort change this order?



 2009/3/4 Chris Douglas chri...@yahoo-inc.com

  The output of each map is sorted by partition and by key within that
  partition. The reduce merges sorted map output assigned to its partition
  into the reduce. The following may be helpful:
 
  http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
 
  If your job requires total order, consider
  o.a.h.mapred.lib.TotalOrderPartitioner. -C
 
 
  On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
 
   can you provide more info about sortint? The sort is happend on the
 whole
  data set, or just on the specified partion?
 
  2009/3/4 Mikhail Yakshin greycat.na@gmail.com
 
   On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
 
  This is normal behavior. The Reducer is guaranteed to receive all the
  results for its partition in sorted order. No reduce can start until
 all
 
  the
 
  maps are completed, since any running map could emit a result that
 would
  violate the order for the results it currently has. -C
 
 
  _Reducers_ usually start almost immediately and start downloading data
  emitted by mappers as they go. This is their first phase. Their second
  phase can start only after completion of all mappers. In their second
  phase, they're sorting received data, and in their third phase they're
  doing real reduction.
 
  --
  WBR, Mikhail Yakshin
 
 
 
 
  --
  http://daily.appspot.com/food/
 
 
 


 --
 http://daily.appspot.com/food/




-- 
M. Raşit ÖZDAŞ

Re: When do we use the Key value for a map function?

2009-03-01 Thread Rasit OZDAS

Amit, it's not used here in this example, but it has other uses.
As I needed, you can pass in the name of input file as key, for example.

Rasit

2009/3/1 Kumar, Amit H. ahku...@odu.edu

 A very Basic Question:

 Form the WordCount example below: I don't see why do we need the
 LongWritable key argument in the Map function. Can anybody tell me the
 importance of it?
 As I understand the worker process reads in the designated input split as a
 series of strings. Which the map functions operates on to produce the key,
 value pair, in this case the 'output' variable.  Then, Why would one need
 LongWritable key as the argument for map function?

 Thank you,
 Amit

 snip
 public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
  }
}
  }
 /snip





-- 
M. Raşit ÖZDAŞ

Re: Reduce doesn't start until map finishes

2009-03-01 Thread Rasit OZDAS

Strange, that I've last night tried 1 input files (maps), waiting time
after maps increases (probably linearly)

2009/3/2 Rasit OZDAS rasitoz...@gmail.com

 I have 6 reducers, Nick, still no luck..

 2009/3/2 Nick Cen cenyo...@gmail.com

 how many reducer do you have? You should make this value larger then 1 to
 make mapper and reducer run concurrently. You can set this value from
 JobConf.*setNumReduceTasks*().


 2009/3/2 Rasit OZDAS rasitoz...@gmail.com

  Hi!
 
  Whatever code I run on hadoop, reduce starts a few seconds after map
  finishes.
  And worse, when I run 10 jobs parallely (using threads and sending one
  after
  another)
  all maps finish sequentially, then after 8-10 seconds reduces start.
  I use reducer also as combiner, my cluster has 6 machines, namenode and
  jobtracker run also as slaves.
  There were 44 maps and 6 reduces in the last example, I never tried a
  bigger
  job.
 
  What can the problem be? I've read somewhere that this is not the normal
  behaviour.
  Replication factor is 3.
  Thank you in advance for any pointers.
 
  Rasit
 



 --
 http://daily.appspot.com/food/




 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Re: why print this error when using MultipleOutputFormat?

2009-02-25 Thread Rasit OZDAS

Qiang,
I couldn't find now which one, but there is a JIRA issue about
MultipleTextOutputFormat (especially when reducers = 0).
If you have no reducers, you can try having one or two, then you can see if
your problem is related with this one.

Cheers,
Rasit

2009/2/25 ma qiang maqiang1...@gmail.com

 Thanks for your reply.
 If I increase the number of computers, can we solve this problem of
 running out of file descriptors?




 On Wed, Feb 25, 2009 at 11:07 AM, jason hadoop jason.had...@gmail.com
 wrote:
  My 1st guess is that your application is running out of file
  descriptors,possibly because your MultipleOutputFormat  instance is
 opening
  more output files than you expect.
  Opening lots of files in HDFS is generally a quick route to bad job
  performance if not job failure.
 
  On Tue, Feb 24, 2009 at 6:58 PM, ma qiang maqiang1...@gmail.com wrote:
 
  Hi all,
I have one class extends MultipleOutputFormat as below,
 
   public class MyMultipleTextOutputFormatK, V extends
  MultipleOutputFormatK, V {
 private TextOutputFormatK, V theTextOutputFormat = null;
 
 @Override
 protected RecordWriterK, V getBaseRecordWriter(FileSystem fs,
 JobConf job, String name, Progressable arg3)
 throws
  IOException {
 if (theTextOutputFormat == null) {
 theTextOutputFormat = new TextOutputFormatK,
 V();
 }
 return theTextOutputFormat.getRecordWriter(fs, job, name,
  arg3);
 }
 @Override
 protected String generateFileNameForKeyValue(K key, V value,
 String
  name) {
 return name + _ + key.toString();
 }
  }
 
 
  also conf.setOutputFormat(MultipleTextOutputFormat2.class) in my job
  configuration. but when the program run, error print as follow:
 
  09/02/25 10:22:32 INFO mapred.JobClient: Task Id :
  attempt_200902250959_0002_r_01_0, Status : FAILED
  java.io.IOException: Could not read from stream
 at
  org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
 at java.io.DataInputStream.readByte(DataInputStream.java:248)
 at
  org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
 at
  org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
 at org.apache.hadoop.io.Text.readString(Text.java:400)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
 
  09/02/25 10:22:42 INFO mapred.JobClient:  map 100% reduce 69%
  09/02/25 10:22:55 INFO mapred.JobClient:  map 100% reduce 0%
  09/02/25 10:22:55 INFO mapred.JobClient: Task Id :
  attempt_200902250959_0002_r_00_1, Status : FAILED
  org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 
 
 /user/qiang/output/_temporary/_attempt_200902250959_0002_r_00_1/part-0_t0x5y3
  could only be replicated to 0 nodes, instead of 1
 at
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
 at org.apache.hadoop.ipc.Client.call(Client.java:696)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
 at $Proxy1.addBlock(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 at
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 at $Proxy1.addBlock(Unknown Source)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

Re: Hadoop Streaming -file option

2009-02-23 Thread Rasit OZDAS

Hadoop uses RMI for file copy operations.
Clients listen port 50010 for this operation.
I assume, it's sending the file as byte stream.

Cheers,
Rasit

2009/2/23 Bing TANG whutg...@gmail.com

 Hi, everyone,
 Could somdone tell me the principle of -file when using Hadoop
 Streaming. I want to ship a big file to Slaves, so how it works?

 Hadoop uses SCP to copy? How does Hadoop deal with -file option?





-- 
M. Raşit ÖZDAŞ

Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-20 Thread Rasit OZDAS

Erik, did you correctly placed ports in properties window?
Port 9000 under Map/Reduce Master on the left, 9001 under DFS Master on
the right.


2009/2/19 Erik Holstad erikhols...@gmail.com

 Thanks guys!
 Running Linux and the remote cluster is also Linux.
 I have the properties set up like that already on my remote cluster, but
 not sure where to input this info into Eclipse.
 And when changing the ports to 9000 and 9001 I get:

 Error: java.io.IOException: Unknown protocol to job tracker:
 org.apache.hadoop.dfs.ClientProtocol

 Regards Erik




-- 
M. Raşit ÖZDAŞ

Re: empty log file...

2009-02-20 Thread Rasit OZDAS

Zander,
I've looked at my datanode logs on the slaves, but they are all in quite
small sizes, although we've run many jobs on them.
And running 2 new jobs also didn't add anything to them.
(As I understand from the contents of the logs, hadoop logs especially
operations about DFS performance tests.)

Cheers,
Rasit

2009/2/20 zander1013 zander1...@gmail.com


 hi,

 i am setting up hadoop for the first time on multi-node cluster. right now
 i
 have two nodes. the two node cluster consists of two laptops connected via
 ad-hoc wifi network. they they do not have access to the internet. i
 formated the datanodes on both machines prior to startup...

 output form the commands /usr/local/hadoop/bin/start-all.sh, jps (on both
 machines), and /usr/local/hadoop/bin/stop-all.sh all appear normal. however
 the file /usr/local/hadoop/logs/hadoop-hadoop-datanode-node1.log (the slave
 node) is empty.

 the same file for the master node shows the startup and shutdown events as
 normal and without error.

 is it okay that the log file on the slave is empty?

 zander
 --
 View this message in context:
 http://www.nabble.com/empty-log-file...-tp22113398p22113398.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: Map/Recuce Job done locally?

2009-02-20 Thread Rasit OZDAS

Philipp, I have no problem running jobs locally with eclipse (via hadoop
plugin) and observing it from browser.
(Please note that jobtracker page doesn't refresh automatically, you need to
refresh it manually.)

Cheers,
Rasit

2009/2/19 Philipp Dobrigkeit pdobrigk...@gmx.de

 When I start my job from eclipse it gets processed and the output is
 generated, but it never shows up in my JobTracker, which is opened in my
 browser. Why is this happening?
 --
 Pt! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen:
 http://www.gmx.net/de/go/multimessenger01




-- 
M. Raşit ÖZDAŞ

Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-19 Thread Rasit OZDAS

Erik,
Try to add following properties into hadoop-site.xml:

property
namefs.default.name/name
valuehdfs://ip_address:9000/value
/property
property
namemapred.job.tracker/name
valuehdfs://ip_address:9001/value
/property

This way your ports become static. Then use port 9001 for MR, 9000 for HDFS
in your properties window.
If it still doesn't work, try to write ip address instead of host name as
target host.

Hope this helps,
Rasit

2009/2/18 Erik Holstad erikhols...@gmail.com

 I'm using Eclipse 3.3.2 and want to view my remote cluster using the Hadoop
 plugin.
 Everything shows up and I can see the map/reduce perspective but when
 trying
 to
 connect to a location I get:
 Error: Call failed on local exception

 I've set the host to for example xx0, where xx0 is a remote machine
 accessible from
 the terminal, and the ports to 50020/50040 for M/R master and
 DFS master respectively. Is there anything I'm missing to set for remote
 access to the
 Hadoop cluster?

 Regards Erik




-- 
M. Raşit ÖZDAŞ

Re: Allowing other system users to use Haddoop

2009-02-18 Thread Rasit OZDAS

Nicholas, like Matei said,
There is 2 possibility in terms of permissions:

(any permissions command is just-like in linux)

1. Create a directory for a user. Make the user owner of that directory:
hadoop dfs -chown ... (assuming hadoop doesn't need to have write access to
any file outside user's home directory)
2. Convert group ownership of all files in HDFS to a group name which any
user has. (hadoop dfs -chgrp -R groupname /). Then give group write access
(hadoop dfs -chmod -R g+w /), again to all files. (here, any user runs jobs,
hadoop creates automatically a separated home directory). This way is better
for development environment, I think.

Cheers,
Rasit

2009/2/18 Matei Zaharia ma...@cloudera.com

Other users should be able to submit jobs using the same commands
(bin/hadoop ...). Are there errors you ran into?
One thing is that you'll need to grant them permissions over any files in
HDFS that you want them to read. You can do it using bin/hadoop fs -chmod,
which works like chmod on Linux. You may need to run this as the root user
(sudo bin/hadoop fs -chmod). Also, I don't remember exactly, but you may
need to create home directories for them in HDFS as well (again create them
as root, and then sudo bin/hadoop fs -chown them).

On Tue, Feb 17, 2009 at 10:48 AM, Nicholas Loulloudes
loulloude...@cs.ucy.ac.cy wrote:

Hi all,

I just installed Hadoop (Single Node) on a Linux Ubuntu distribution as
per the instructions found in the following website:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

I followed the instructions of the website to create a hadoop system
user and group and i was able to run a Map Reduce job successfully.

What i want to do now is to create more system users which will be able
to use Hadoop for running Map Reduce jobs.

Is there any guide on how to achieve these??

Any suggestions will be highly appreciated.

Thanks in advance,

--
_

Nicholas Loulloudes
High Performance Computing Systems Laboratory (HPCL)
University of Cyprus,
Nicosia, Cyprus

--
M. Raşit ÖZDAŞ

Re: GenericOptionsParser warning

2009-02-18 Thread Rasit OZDAS

Hi,
There is a JIRA issue about this problem, if I understand it correctly:
https://issues.apache.org/jira/browse/HADOOP-3743

Strange, that I searched all source code, but there exists only this control
in 2 places:

if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) {
  LOG.warn(Use GenericOptionsParser for parsing the arguments.  +
   Applications should implement Tool for the same.);
}

Just an if block for logging, no extra controls.
Am I missing something?

If your class implements Tool, than there shouldn't be a warning.

Cheers,
Rasit

2009/2/18 Steve Loughran ste...@apache.org

 Sandhya E wrote:

 Hi All

 I prepare my JobConf object in a java class, by calling various set
 apis in JobConf object. When I submit the jobconf object using
 JobClient.runJob(conf), I'm seeing the warning:
 Use GenericOptionsParser for parsing the arguments. Applications
 should implement Tool for the same. From hadoop sources it looks like
 setting mapred.used.genericoptionsparser will prevent this warning.
 But if I set this flag to true, will it have some other side effects.

 Thanks
 Sandhya


 Seen this message too -and it annoys me; not tracked it down




-- 
M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS

John, did you try -D option instead of -jobconf,

I had -D option in my code, I changed it with -jobconf, this is what I get:

...
...
Options:
  -inputpath DFS input file(s) for the Map step
  -output   path DFS output directory for the Reduce step
  -mapper   cmd|JavaClassName  The streaming command to run
  -combiner JavaClassName Combiner has to be a Java class
  -reducer  cmd|JavaClassName  The streaming command to run
  -file file File/dir to be shipped in the Job jar file
  -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks num  Optional.
  -inputreader spec  Optional.
  -cmdenv   n=vOptional. Pass env.var to streaming commands
  -mapdebug path  Optional. To run this script when a map task fails
  -reducedebug path  Optional. To run this script when a reduce task fails

  -verbose

Generic options supported are
-conf configuration file specify an application configuration file
-D property=valueuse value for given property
-fs local|namenode:port  specify a namenode
-jt local|jobtracker:portspecify a job tracker
-files comma separated list of filesspecify comma separated files to
be copied to the map reduce cluster
-libjars comma separated list of jarsspecify comma separated jar files
to include in the classpath.
-archives comma separated list of archivesspecify comma separated
archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



I think -jobconf is not used in v.0.19 .

2009/2/18 S D sd.codewarr...@gmail.com

 I'm having trouble overriding the maximum number of map tasks that run on a
 given machine in my cluster. The default value of
 mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
 When
 running my job I passed

 -jobconf mapred.tasktracker.map.tasks.maximum=1

 to limit map tasks to one per machine but each machine was still allocated
 2
 map tasks (simultaneously).  The only way I was able to guarantee a maximum
 of one map task per machine was to change the value of the property in
 hadoop-site.xml. This is unsatisfactory since I'll often be changing the
 maximum on a per job basis. Any hints?

 On a different note, when I attempt to pass params via -D I get a usage
 message; when I use -jobconf the command goes through (and works in the
 case
 of mapred.reduce.tasks=0 for example) but I get  a deprecation warning).

 Thanks,
 John




-- 
M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS

I see, John.
I also use 0.19, just to note, -D option should come first, since it's one
of generic options. I use it without any errors.

Cheers,
Rasit

2009/2/18 S D sd.codewarr...@gmail.com

 Thanks for your response Rasit. You may have missed a portion of my post.

  On a different note, when I attempt to pass params via -D I get a usage
 message; when I use
  -jobconf the command goes through (and works in the case of
 mapred.reduce.tasks=0 for
  example) but I get  a deprecation warning).

 I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0
 as well?

 John


 On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

  John, did you try -D option instead of -jobconf,
 
  I had -D option in my code, I changed it with -jobconf, this is what I
 get:
 
  ...
  ...
  Options:
   -inputpath DFS input file(s) for the Map step
   -output   path DFS output directory for the Reduce step
   -mapper   cmd|JavaClassName  The streaming command to run
   -combiner JavaClassName Combiner has to be a Java class
   -reducer  cmd|JavaClassName  The streaming command to run
   -file file File/dir to be shipped in the Job jar file
   -inputformat
  TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
  Optional.
   -outputformat TextOutputFormat(default)|JavaClassName  Optional.
   -partitioner JavaClassName  Optional.
   -numReduceTasks num  Optional.
   -inputreader spec  Optional.
   -cmdenv   n=vOptional. Pass env.var to streaming commands
   -mapdebug path  Optional. To run this script when a map task fails
   -reducedebug path  Optional. To run this script when a reduce task
 fails
 
   -verbose
 
  Generic options supported are
  -conf configuration file specify an application configuration file
  -D property=valueuse value for given property
  -fs local|namenode:port  specify a namenode
  -jt local|jobtracker:portspecify a job tracker
  -files comma separated list of filesspecify comma separated files
 to
  be copied to the map reduce cluster
  -libjars comma separated list of jarsspecify comma separated jar
  files
  to include in the classpath.
  -archives comma separated list of archivesspecify comma separated
  archives to be unarchived on the compute machines.
 
  The general command line syntax is
  bin/hadoop command [genericOptions] [commandOptions]
 
  For more details about these options:
  Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
 
 
 
  I think -jobconf is not used in v.0.19 .
 
  2009/2/18 S D sd.codewarr...@gmail.com
 
   I'm having trouble overriding the maximum number of map tasks that run
 on
  a
   given machine in my cluster. The default value of
   mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
   When
   running my job I passed
  
   -jobconf mapred.tasktracker.map.tasks.maximum=1
  
   to limit map tasks to one per machine but each machine was still
  allocated
   2
   map tasks (simultaneously).  The only way I was able to guarantee a
  maximum
   of one map task per machine was to change the value of the property in
   hadoop-site.xml. This is unsatisfactory since I'll often be changing
 the
   maximum on a per job basis. Any hints?
  
   On a different note, when I attempt to pass params via -D I get a usage
   message; when I use -jobconf the command goes through (and works in the
   case
   of mapred.reduce.tasks=0 for example) but I get  a deprecation
 warning).
  
   Thanks,
   John
  
 
 
 
  --
  M. Raşit ÖZDAŞ
 




-- 
M. Raşit ÖZDAŞ

Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Rasit OZDAS

Stefan and Thibaut, are you using MultipleOutputFormat, and how many
reducers do you have?
if you're using MultipleOutputFormat and have no reducer, there is a JIRA
ticket about this issue.
https://issues.apache.org/jira/browse/HADOOP-5268

Or there is a different JIRA issue (it's not resolved yet, but gives some
underlying info)
https://issues.apache.org/jira/browse/HADOOP-4264

Or this issue (not resolved):
https://issues.apache.org/jira/browse/HADOOP-1583

Rasit

2009/2/16 Thibaut_ tbr...@blue.lu


 I have the same problem.

 is there any solution to this?

 Thibaut


 --
 View this message in context:
 http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22043484.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: Can never restart HDFS after a day or two

2009-02-17 Thread Rasit OZDAS

I agree with Amandeep, and results will remain forever, unless you manually
delete them.

If we are on the right road,
change hadoop.tmp.dir property to be outside of /tmp, or changing
dfs.name.dir and dfs.data.dir should be enough for basic use (I didn't have
to change anything else).

Cheers,
Rasit

2009/2/17 Amandeep Khurana ama...@gmail.com

 Where are your namenode and datanode storing the data? By default, it goes
 into the /tmp directory. You might want to move that out of there.

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Mon, Feb 16, 2009 at 8:11 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Hi all,
 
  I consistently have this problem that I can run HDFS and restart it after
  short breaks of a few hours, but the next day I always have to reformat
  HDFS
  before the daemons begin to work.
 
  Is that normal? Maybe this is treated as temporary data, and the results
  need to be copied out of HDFS and not stored for long periods of time? I
  verified that the files in /tmp related to hadoop are seemingly intact.
 
  Thank you,
  Mark
 




-- 
M. Raşit ÖZDAŞ

Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Rasit OZDAS

:D

Then I found out that there is 3 similar issue about this problem :D
Quite useful information, isn't it? ;)


2009/2/17 Thibaut_ tbr...@blue.lu


 Hello Rasi,

 https://issues.apache.org/jira/browse/HADOOP-5268 is my bug report.

 Thibaut

 --
 View this message in context:
 http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22060926.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: HDFS bytes read job counters?

2009-02-17 Thread Rasit OZDAS

Nartan, If you're using BytesWritable, I've heard that it doesn't return
only valid bytes, it actually returns more than that.
Here is this issue discussed:
http://www.nabble.com/can%27t-read-the-SequenceFile-correctly-td21866960.html

Cheers,
Rasit



2009/2/18 Nathan Marz nat...@rapleaf.com

 Hello,

 I'm seeing very odd numbers from the HDFS job tracker page. I have a job
 that operates over approximately 200 GB of data (209715200047 bytes to be
 exact), and HDFS bytes read is 2,103,170,802,501 (2 TB).

 The Map input bytes is set to 209,714,811,510, which is a correct
 number.

 The job only took 10 minutes to run, so there's no way that that much data
 was actually read. Anyone have any idea of what's going on here?

 Thanks,
 Nathan Marz




-- 
M. Raşit ÖZDAŞ

Re: Copying a file to specified nodes

2009-02-16 Thread Rasit OZDAS

Thanks, Jeff.
After considering JIRA link you've given and making some investigation:

It seems that this JIRA ticket didn't draw much attention, so will
take much time to be considered.
After some more investigation I found out that when I copy the file to
HDFS from a specific DataNode, first copy will be written to that
DataNode itself. This solution will take long to implement, I think.
But we definitely need this feature, so if we have no other choice,
we'll go though it.

Any further info (or comments on my solution) is appreciated.

Cheers,
Rasit

2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
 Hey Rasit,

 I'm not sure I fully understand your description of the problem, but
 you might want to check out the JIRA ticket for making the replica
 placement algorithms in HDFS pluggable
 (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
 case there.

 Regards,
 Jeff

 On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi,

 We have thousands of files, each dedicated to a user.  (Each user has
 access to other users' files, but they do this not very often.)
 Each user runs map-reduce jobs on the cluster.
 So we should seperate his/her files equally across the cluster,
 so that every machine can take part in the process (assuming he/she is
 the only user running jobs).
 For this we should initially copy files to specified nodes:
 User A :   first file : Node 1, second file: Node 2, .. etc.
 User B :   first file : Node 1, second file: Node 2, .. etc.

 I know, hadoop create also replicas, but in our solution at least one
 file will be in the right place
 (or we're willing to control other replicas too).

 Rebalancing is also not a problem, assuming it uses the information
 about how much a computer is in use.
 It even helps for a better organization of files.

 How can we copy files to specified nodes?
 Or do you have a better solution for us?

 I couldn't find a solution to this, probably such an option doesn't exist.
 But I wanted to take an expert's opinion about this.

 Thanks in advance..
 Rasit




-- 
M. Raşit ÖZDAŞ

Re: datanode not being started

2009-02-16 Thread Rasit OZDAS

Sandy, as far as I remember, there were some threads about the same
problem (I don't know if it's solved). Searching the mailing list for
this error: could only be replicated to 0 nodes, instead of 1 may
help.

Cheers,
Rasit

2009/2/16 Sandy snickerdoodl...@gmail.com:
 just some more information:
 hadoop fsck produces:
 Status: HEALTHY
  Total size: 0 B
  Total dirs: 9
  Total files: 0 (Files currently being written: 1)
  Total blocks (validated): 0
  Minimally replicated blocks: 0
  Over-replicated blocks: 0
  Under-replicated blocks: 0
  Mis-replicated blocks: 0
  Default replication factor: 1
  Average block replication: 0.0
  Corrupt blocks: 0
  Missing replicas: 0
  Number of data-nodes: 0
  Number of racks: 0


 The filesystem under path '/' is HEALTHY

 on the newly formatted hdfs.

 jps says:
 4723 Jps
 4527 NameNode
 4653 JobTracker


 I can't copy files onto the dfs since I get NotReplicatedYetExceptions,
 which I suspect has to do with the fact that there are no datanodes. My
 cluster is a single MacPro with 8 cores. I haven't had to do anything
 extra before in order to get datanodes to be generated.

 09/02/15 15:56:27 WARN dfs.DFSClient: Error Recovery for block null bad
 datanode[0]
 copyFromLocal: Could not get block locations. Aborting...


 The corresponding error in the logs is:

 2009-02-15 15:56:27,123 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 9000, call addBlock(/user/hadoop/input/.DS_Store,
 DFSClient_755366230) from 127.0.0.1:49796: error: java.io.IOException: File
 /user/hadoop/input/.DS_Store could only be replicated to 0 nodes, instead of
 1
 java.io.IOException: File /user/hadoop/input/.DS_Store could only be
 replicated to 0 nodes, instead of 1
 at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1120)
 at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

 On Sun, Feb 15, 2009 at 3:26 PM, Sandy snickerdoodl...@gmail.com wrote:

 Thanks for your responses.

 I checked in the namenode and jobtracker logs and both say:

 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call
 delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from
 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot
 delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node
 is in safe mode.
 The ratio of reported blocks 0. has not reached the threshold 0.9990.
 Safe mode will be turned off automatically.
 org.apache.hadoop.dfs.SafeModeException: Cannot delete
 /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in
 safe mode.
 The ratio of reported blocks 0. has not reached the threshold 0.9990.
 Safe mode will be turned off automatically.
 at
 org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505)
 at
 org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477)
 at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425)
 at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)


 I think this is a continuation of my running problem. The nodes stay in
 safe mode, but won't come out, even after several minutes. I believe this is
 due to the fact that it keep trying to contact a datanode that does not
 exist. Any suggestions on what I can do?

 I have recently tried to reformat the hdfs, using bin/hadoop namenode
 -format. From the output directed to standard out, I thought this completed
 correctly:

 Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name
 ? (Y or N) Y
 09/02/15 15:16:39 INFO fs.FSNamesystem:
 fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5
 09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup
 09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true
 09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0
 seconds.
 09/02/15 15:16:39 INFO dfs.Storage: Storage directory
 /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been successfully
 formatted.
 09/02/15 15:16:39 INFO dfs.NameNode: SHUTDOWN_MSG:

Re: Copying a file to specified nodes

2009-02-16 Thread Rasit OZDAS

Yes, I've tried the long solution;
when I execute   ./hadoop dfs -put ... from a datanode,
in any case 1 copy gets written to that datanode.

But I think I should use SSH for this,
Anybody knows a better way?

Thanks,
Rasit

2009/2/16 Rasit OZDAS rasitoz...@gmail.com:
 Thanks, Jeff.
 After considering JIRA link you've given and making some investigation:

 It seems that this JIRA ticket didn't draw much attention, so will
 take much time to be considered.
 After some more investigation I found out that when I copy the file to
 HDFS from a specific DataNode, first copy will be written to that
 DataNode itself. This solution will take long to implement, I think.
 But we definitely need this feature, so if we have no other choice,
 we'll go though it.

 Any further info (or comments on my solution) is appreciated.

 Cheers,
 Rasit

 2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
 Hey Rasit,

 I'm not sure I fully understand your description of the problem, but
 you might want to check out the JIRA ticket for making the replica
 placement algorithms in HDFS pluggable
 (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
 case there.

 Regards,
 Jeff

 On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi,

 We have thousands of files, each dedicated to a user.  (Each user has
 access to other users' files, but they do this not very often.)
 Each user runs map-reduce jobs on the cluster.
 So we should seperate his/her files equally across the cluster,
 so that every machine can take part in the process (assuming he/she is
 the only user running jobs).
 For this we should initially copy files to specified nodes:
 User A :   first file : Node 1, second file: Node 2, .. etc.
 User B :   first file : Node 1, second file: Node 2, .. etc.

 I know, hadoop create also replicas, but in our solution at least one
 file will be in the right place
 (or we're willing to control other replicas too).

 Rebalancing is also not a problem, assuming it uses the information
 about how much a computer is in use.
 It even helps for a better organization of files.

 How can we copy files to specified nodes?
 Or do you have a better solution for us?

 I couldn't find a solution to this, probably such an option doesn't exist.
 But I wanted to take an expert's opinion about this.

 Thanks in advance..
 Rasit




 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Re: datanode not being started

2009-02-16 Thread Rasit OZDAS

Sandy, I have no idea about your issue :(

Zander,
Your problem is probably about this JIRA issue:
http://issues.apache.org/jira/browse/HADOOP-1212

Here is 2 workarounds explained:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)#java.io.IOException:_Incompatible_namespaceIDs

I haven't tried it, hope it helps.
Rasit

2009/2/17 zander1013 zander1...@gmail.com:

 hi,

 i am not seeing the DataNode run either. but i am seeing an extra process
 TaskTracker run.

 here is what hapens when i start the cluster run jps and stop the cluster...

 had...@node0:/usr/local/hadoop$ bin/start-all.sh
 starting namenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-node0.out
 node0.local: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-node0.out
 node1.local: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-node1.out
 node0.local: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node0.out
 starting jobtracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node0.out
 node0.local: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node0.out
 node1.local: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node1.out
 had...@node0:/usr/local/hadoop$ jps
 13353 TaskTracker
 13126 SecondaryNameNode
 12846 NameNode
 13455 Jps
 13232 JobTracker
 had...@node0:/usr/local/hadoop$ bin/stop-all.sh
 stopping jobtracker
 node0.local: stopping tasktracker
 node1.local: stopping tasktracker
 stopping namenode
 node0.local: no datanode to stop
 node1.local: no datanode to stop
 node0.local: stopping secondarynamenode
 had...@node0:/usr/local/hadoop$

 here is the tail of the log file for the session above...
 /
 2009-02-16 19:35:13,999 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = node1/127.0.1.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890;
 compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2009-02-16 19:35:18,999 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
 Incompatible namespaceIDs in
 /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data: namenode namespaceID =
 1050914495; datanode namespaceID = 722953254
at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:287)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:205)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)

 2009-02-16 19:35:19,000 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at node1/127.0.1.1
 /

 i have not seen DataNode run yet. i have only started and stopped the
 cluster a couple of times.

 i tried to reformat datanode and namenode with bin/hadoop datanode -format
 and bin/hadoop namenode -format from /usr/local/hadoop dir.

 please advise

 zander



 Mithila Nagendra wrote:

 Hey Sandy
 I had a similar problem with Hadoop. All I did was I stopped all the
 daemons
 using stop-all.sh. Then formatted the namenode again using hadoop namenode
 -format. After this I went on to restarting everything by using
 start-all.sh

 I hope you dont have much data on the datanode, reformatting it would
 erase
 everything out.

 Hope this helps!
 Mithila



 On Sat, Feb 14, 2009 at 2:39 AM, james warren ja...@rockyou.com wrote:

 Sandy -

 I suggest you take a look into your NameNode and DataNode logs.  From the
 information posted, these likely would be at


 /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.log

 /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.log

 If the cause isn't obvious from what you see there, could you please post
 the last few lines from each log?

 -jw

 On Fri, Feb 13, 2009 at 3:28 PM, Sandy snickerdoodl...@gmail.com wrote:

  Hello,
 
  I would really appreciate any help I can

Re: Running Map and Reduce Sequentially

2009-02-13 Thread Rasit OZDAS

Kris,
This is the case when you have only 1 reducer.
If it doesn't have any side effects for you..

Rasit


2009/2/14 Kris Jirapinyo kjirapi...@biz360.com:
 Is there a way to tell Hadoop to not run Map and Reduce concurrently?  I'm
 running into a problem where I set the jvm to Xmx768 and it seems like 2
 mappers and 2 reducers are running on each machine that only has 1.7GB of
 ram, so it complains of not being able to allocate memory...(which makes
 sense since 4x768mb  1.7GB).  So, if it would just finish the Map and then
 start on Reduce, then there would be 2 jvm's running on one machine at any
 given time and thus possibly avoid this out of memory error.




-- 
M. Raşit ÖZDAŞ

Re: Hadoop setup questions

2009-02-13 Thread Rasit OZDAS

I agree with Amar and James,

if you require permissions for your project,
then
1. create a group in linux for your user.
2. give group write access to all files in HDFS. (hadoop dfs -chmod -R
g+w / - or sth, I'm not totally sure.)
3. change group ownership of all files in HDFS. (hadoop dfs -chgrp -R
your_group_name / - I'm not totally sure again..)

cheers,
Rasit

2009/2/12 james warren ja...@rockyou.com:
Like Amar said. Try adding

property
namedfs.permissions/name
valuefalse/value
/property

to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml),
restart your daemons and give it a whirl.

cheers,
-jw

On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat ama...@yahoo-inc.com wrote:

bjday wrote:

Good morning everyone,

I have a question about correct setup for hadoop. I have 14 Dell
computers in a lab. Each connected to the internet and each independent of
each other. All run CentOS. Logins are handled by NIS. If userA logs into
the master and starts the daemons and UserB logs into the master and wants
to run a job while the daemons from UserA are still running the following
error occurs:

copyFromLocal: org.apache.hadoop.security.AccessControlException:
Permission denied: user=UserB, access=WRITE,
inode=user:UserA:supergroup:rwxr-xr-x

Looks like one of your files (input or output) is of different user. Seems
like your DFS has permissions enabled. If you dont require permissions then
disable it else make sure that the input/output paths are under your
permission (/user/userB is the hone directory for userB).
Amar

what needs to be changed to allow UserB-UserZ to run their jobs? Does
there need to be a local user the everyone logs into as and run from there?
Should Hadoop be ran in an actual cluster instead of independent computers?
Any ideas what is the correct configuration settings that allow it?

I followed Ravi Phulari suggestions and followed:

http://hadoop.apache.org/core/docs/current/quickstart.html

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

These allowed me to get Hadoop running on the 14 computers when I login
and everything works fine, thank you Ravi. The problem occurs when
additional people attempt to run jobs simultaneously.

Thank you,

Brian

--
M. Raşit ÖZDAŞ

Re: Hadoop setup questions

2009-02-13 Thread Rasit OZDAS

With this configuration, any user having that group name will be able
to write to any location..
(I've tried this in local network, though)

2009/2/14 Rasit OZDAS rasitoz...@gmail.com:
I agree with Amar and James,

cheers,
Rasit

2009/2/12 james warren ja...@rockyou.com:
Like Amar said. Try adding

property
namedfs.permissions/name
valuefalse/value
/property

to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml),
restart your daemons and give it a whirl.

cheers,
-jw

On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat ama...@yahoo-inc.com wrote:

bjday wrote:

Good morning everyone,

I have a question about correct setup for hadoop. I have 14 Dell
computers in a lab. Each connected to the internet and each independent
of
each other. All run CentOS. Logins are handled by NIS. If userA logs
into
the master and starts the daemons and UserB logs into the master and wants
to run a job while the daemons from UserA are still running the following
error occurs:

copyFromLocal: org.apache.hadoop.security.AccessControlException:
Permission denied: user=UserB, access=WRITE,
inode=user:UserA:supergroup:rwxr-xr-x

what needs to be changed to allow UserB-UserZ to run their jobs? Does
there need to be a local user the everyone logs into as and run from there?
Should Hadoop be ran in an actual cluster instead of independent
computers?
Any ideas what is the correct configuration settings that allow it?

I followed Ravi Phulari suggestions and followed:

http://hadoop.apache.org/core/docs/current/quickstart.html

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

These allowed me to get Hadoop running on the 14 computers when I login
and everything works fine, thank you Ravi. The problem occurs when
additional people attempt to run jobs simultaneously.

Thank you,

Brian

--
M. Raşit ÖZDAŞ

Re: Best practices on spliltting an input line?

2009-02-12 Thread Rasit OZDAS

Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins andy.saut...@returnpath.net:


   I have question.  I've dabbled with different ways of tokenizing an
 input file line for processing.  I've noticed in my somewhat limited
 tests that there seem to be some pretty reasonable performance
 differences between different tokenizing methods.  For example, roughly
 it seems to split a line on tokens ( tab delimited in my case ) that
 Scanner is the slowest, followed by String.spit and StringTokenizer
 being the fastest.  StringTokenizer, for my application, has the
 unfortunate characteristic of not returning blank tokens ( i.e., parsing
 a,b,c,,d would return a,b,c,d instead of a,b,c,,d).
 The WordCount example uses StringTokenizer which makes sense to me,
 except I'm currently getting hung up on not returning blank tokens.  I
 did run across the com.Ostermiller.util StringTokenizer replacement that
 handles null/blank tokens
 (http://ostermiller.org/utils/StringTokenizer.html ) which seems
 possible to use, but it sure seems like someone else has solved this
 problem already better than I have.



   So, my question is, is there a best practice for splitting an input
 line especially when NULL tokens are expected ( i.e., two consecutive
 delimiter characters )?



   Any thoughts would be appreciated



   Thanks



   Andy





-- 
M. Raşit ÖZDAŞ

Re: stable version

2009-02-11 Thread Rasit OZDAS

Yes, version 18.3 is the most stable one. It has added patches,
without not-proven new functionality.

2009/2/11 Owen O'Malley omal...@apache.org:

 On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:

 Maybe version 0.18
 is better suited for production environment?

 Yahoo is mostly on 0.18.3 + some patches at this point.

 -- Owen




-- 
M. Raşit ÖZDAŞ

Re: Loading native libraries

2009-02-11 Thread Rasit OZDAS

I have also the same problem.
It would be wonderful if someone has some info about this..

Rasit

2009/2/10 Mimi Sun m...@rapleaf.com:
 I see UnsatisfiedLinkError.  Also I'm calling
  System.getProperty(java.library.path) in the reducer and logging it. The
 only thing that prints out is
 ...hadoop-0.18.2/bin/../lib/native/Mac_OS_X-i386-32
 I'm using Cascading, not sure if that affects anything.

 - Mimi

 On Feb 10, 2009, at 11:40 AM, Arun C Murthy wrote:


 On Feb 10, 2009, at 11:06 AM, Mimi Sun wrote:

 Hi,

 I'm new to Hadoop and I'm wondering what the recommended method is for
 using native libraries in mapred jobs.
 I've tried the following separately:
 1. set LD_LIBRARY_PATH in .bashrc
 2. set LD_LIBRARY_PATH and  JAVA_LIBRARY_PATH in hadoop-env.sh
 3. set -Djava.library.path=... for mapred.child.java.opts

 For what you are trying (i.e. given that the JNI libs are present on all
 machines at a constant path) setting -Djava.library.path for the child task
 via mapred.child.java.opts should work. What are you seeing?

 Arun


 4. change bin/hadoop to include  $LD_LIBRARY_PATH in addition to the path
 it generates:  HADOOP_OPTS=$HADOOP_OPTS
 -Djava.library.path=$LD_LIBRARY_PATH:$JAVA_LIBRARY_PATH
 5. drop the .so files I need into hadoop/lib/native/...

 1~3 didn't work, 4 and 5 did but seem to be hacks. I also read that I can
 do this using DistributedCache, but that seems to be extra work for loading
 libraries that are already present on each machine. (I'm using the JNI libs
 for berkeley db).
 It seems that there should be a way to configure java.library.path for
 the mapred jobs.  Perhaps bin/hadoop should make use of LD_LIBRARY_PATH?

 Thanks,
 - Mimi






-- 
M. Raşit ÖZDAŞ

Re: what's going on :( ?

2009-02-11 Thread Rasit OZDAS

Hi, Mark

Try to add an extra property to that file, and try to examine if
hadoop recognizes it.
This way you can find out if hadoop uses your configuration file.

2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
 Hey Mark,

 In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020.
 From my understanding of the code, your fs.default.name setting should
 have overridden this port to be 9000. It appears your Hadoop
 installation has not picked up the configuration settings
 appropriately. You might want to see if you have any Hadoop processes
 running and terminate them (bin/stop-all.sh should help) and then
 restart your cluster with the new configuration to see if that helps.

 Later,
 Jeff

 On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote:
 Mark Kerzner wrote:

 Hi,
 Hi,

 why is hadoop suddenly telling me

  Retrying connect to server: localhost/127.0.0.1:8020

 with this configuration

 configuration
  property
    namefs.default.name/name
    valuehdfs://localhost:9000/value
  /property
  property
    namemapred.job.tracker/name
    valuelocalhost:9001/value


 Shouldnt this be

 valuehdfs://localhost:9001/value

 Amar

  /property
  property
    namedfs.replication/name
    value1/value
  /property
 /configuration

 and both this http://localhost:50070/dfshealth.jsp and this
 http://localhost:50030/jobtracker.jsp links work fine?

 Thank you,
 Mark








-- 
M. Raşit ÖZDAŞ

Copying a file to specified nodes

2009-02-10 Thread Rasit OZDAS

Hi,

We have thousands of files, each dedicated to a user.  (Each user has
access to other users' files, but they do this not very often.)
Each user runs map-reduce jobs on the cluster.
So we should seperate his/her files equally across the cluster,
so that every machine can take part in the process (assuming he/she is
the only user running jobs).
For this we should initially copy files to specified nodes:
User A :   first file : Node 1, second file: Node 2, .. etc.
User B :   first file : Node 1, second file: Node 2, .. etc.

I know, hadoop create also replicas, but in our solution at least one
file will be in the right place
(or we're willing to control other replicas too).

Rebalancing is also not a problem, assuming it uses the information
about how much a computer is in use.
It even helps for a better organization of files.

How can we copy files to specified nodes?
Or do you have a better solution for us?

I couldn't find a solution to this, probably such an option doesn't exist.
But I wanted to take an expert's opinion about this.

Thanks in advance..
Rasit

Re: Heap size error

2009-02-07 Thread Rasit OZDAS

Hi, Amandeep,
I've copied following lines from a site:
--
Exception in thread main java.lang.OutOfMemoryError: Java heap space

This can have two reasons:

* Your Java application has a memory leak. There are tools like
YourKit Java Profiler that help you to identify such leaks.
* Your Java application really needs a lot of memory (more than
128 MB by default!). In this case the Java heap size can be increased
using the following runtime parameters:

java -Xmsinitial heap size -Xmxmaximum heap size

Defaults are:

java -Xms32m -Xmx128m

You can set this either in the Java Control Panel or on the command
line, depending on the environment you run your application.
-

Hope this helps,
Rasit

2009/2/7 Amandeep Khurana ama...@gmail.com:
 I'm getting the following error while running my hadoop job:

 09/02/06 15:33:03 INFO mapred.JobClient: Task Id :
 attempt_200902061333_0004_r_00_1, Status : FAILED
 java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source)
at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
at java.lang.AbstractStringBuilder.append(Unknown Source)
at java.lang.StringBuffer.append(Unknown Source)
at TableJoin$Reduce.reduce(TableJoin.java:61)
at TableJoin$Reduce.reduce(TableJoin.java:1)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

 Any inputs?

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz




-- 
M. Raşit ÖZDAŞ

Re: Cannot copy from local file system to DFS

2009-02-07 Thread Rasit OZDAS

Hi, Mithila,

File /user/mithila/test/20417.txt could only be replicated to 0
nodes, instead of 1

I think your datanode isn't working properly.
please take a look at log file of your datanode (logs/*datanode*.log).

If there is no error in that log file, I've heard that hadoop can sometimes mark
a datanode as BAD and refuses to send the block to that node, this
can be the cause.
(List, please correct me if I'm wrong!)

Hope this helps,
Rasit

2009/2/6 Mithila Nagendra mnage...@asu.edu:
 Hey all
 I was trying to run the word count example on one of the hadoop systems I
 installed, but when i try to copy the text files from the local file system
 to the DFS, it throws up the following exception:

 [mith...@node02 hadoop]$ jps
 8711 JobTracker
 8805 TaskTracker
 8901 Jps
 8419 NameNode
 8642 SecondaryNameNode
 [mith...@node02 hadoop]$ cd ..
 [mith...@node02 mithila]$ ls
 hadoop  hadoop-0.17.2.1.tar  hadoop-datastore  test
 [mith...@node02 mithila]$ hadoop/bin/hadoop dfs -copyFromLocal test test
 09/02/06 11:26:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
 java.io.IOException: File /user/mithila/test/20417.txt could only be
 replicated to 0 nodes, instead of 1
at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)

 09/02/06 11:26:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping
 /user/mithila/test/20417.txt retries left 4
 09/02/06 11:26:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
 java.io.IOException: File /user/mithila/test/20417.txt could only be
 replicated to 0 nodes, instead of 1
at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)

Re: copying binary files to a SequenceFile

2009-02-05 Thread Rasit OZDAS

Mark,
http://stuartsierra.com/2008/04/24/a-million-little-files/comment-page-1

In this link, there is a tool to create sequence files from tar.gz and
tar.bz2 files.
I don't think that this is a real solution, but at least it means more
free memory and delay of problems (worst solution).

Rasit

2009/2/5 Mark Kerzner markkerz...@gmail.com:
 Hi all,

 I am copying regular binary files to a SequenceFile, and I am using
 BytesWritable, to which I am giving all the byte[] content of the file.
 However, once it hits a file larger than my computer memory, it may have
 problems. Is there a better way?

 Thank you,
 Mark




-- 
M. Raşit ÖZDAŞ

Re: Bad connection to FS.

2009-02-05 Thread Rasit OZDAS

I can add a little method to follow namenode failures,
I find out such problems by running first   start-all.sh , then  stop-all.sh
if namenode starts without error, stop-all.sh gives the output
stopping namenode.. , but in case of an error, it says no namenode
to stop..
In case of an error, Hadoop log directory is always the first place to look.
It doesn't save the day, but worths noting.

Hope this helps,
Rasit

2009/2/5 lohit lohit.vijayar...@yahoo.com:
 As noted by others NameNode is not running.
 Before formatting anything (which is like deleting your data), try to see why 
 NameNode isnt running.
 search for value of HADOOP_LOG_DIR in ./conf/hadoop-env.sh if you have not 
 set it explicitly it would default to your hadoop 
 installation/logs/*namenode*.log
 Lohit



 - Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: core-user@hadoop.apache.org
 Sent: Wednesday, February 4, 2009 5:26:43 PM
 Subject: Re: Bad connection to FS.

 Here's what I had done..

 1. Stop the whole system
 2. Delete all the data in the directories where the data and the metadata is
 being stored.
 3. Format the namenode
 4. Start the system

 This solved my problem. I'm not sure if this is a good idea to do for you or
 not. I was pretty much installing from scratch so didnt mind deleting the
 files in those directories..

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 3:49 PM, TCK moonwatcher32...@yahoo.com wrote:


 I believe the debug logs location is still specified in hadoop-env.sh (I
 just read the 0.19.0 doc). I think you have to shut down all nodes first
 (stop-all), then format the namenode, and then restart (start-all) and make
 sure that NameNode comes up too. We are using a very old version, 0.12.3,
 and are upgrading.
 -TCK



 --- On Wed, 2/4/09, Mithila Nagendra mnage...@asu.edu wrote:
 From: Mithila Nagendra mnage...@asu.edu
 Subject: Re: Bad connection to FS.
 To: core-user@hadoop.apache.org, moonwatcher32...@yahoo.com
 Date: Wednesday, February 4, 2009, 6:30 PM

 @TCK: Which version of hadoop have you installed?
 @Amandeep: I did tried reformatting the namenode, but it hasn't helped me
 out in anyway.
 Mithila


 On Wed, Feb 4, 2009 at 4:18 PM, TCK moonwatcher32...@yahoo.com wrote:



 Mithila, how come there is no NameNode java process listed by your jps
 command? I would check the hadoop namenode logs to see if there was some
 startup problem (the location of those logs would be specified in
 hadoop-env.sh, at least in the version I'm using).


 -TCK







 --- On Wed, 2/4/09, Mithila Nagendra mnage...@asu.edu wrote:

 From: Mithila Nagendra mnage...@asu.edu

 Subject: Bad connection to FS.

 To: core-user@hadoop.apache.org core-user@hadoop.apache.org, 
 core-user-subscr...@hadoop.apache.org 
 core-user-subscr...@hadoop.apache.org


 Date: Wednesday, February 4, 2009, 6:06 PM



 Hey all



 When I try to copy a folder from the local file system in to the HDFS using

 the command hadoop dfs -copyFromLocal, the copy fails and it gives an error

 which says Bad connection to FS. How do I get past this? The

 following is

 the output at the time of execution:



 had...@renweiyu-desktop:/usr/local/hadoop$ jps

 6873 Jps

 6299 JobTracker

 6029 DataNode

 6430 TaskTracker

 6189 SecondaryNameNode

 had...@renweiyu-desktop:/usr/local/hadoop$ ls

 bin  docslib  README.txt

 build.xmlhadoop-0.18.3-ant.jar   libhdfs  src

 c++  hadoop-0.18.3-core.jar  librecordio  webapps

 CHANGES.txt  hadoop-0.18.3-examples.jar  LICENSE.txt

 conf hadoop-0.18.3-test.jar  logs

 contrib  hadoop-0.18.3-tools.jar NOTICE.txt

 had...@renweiyu-desktop:/usr/local/hadoop$ cd ..

 had...@renweiyu-desktop:/usr/local$ ls

 bin  etc  games  gutenberg  hadoop  hadoop-0.18.3.tar.gz  hadoop-datastore

 include  lib  man  sbin  share  src

 had...@renweiyu-desktop:/usr/local$ hadoop/bin/hadoop dfs -copyFromLocal

 gutenberg gutenberg

 09/02/04 15:58:21 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 0 time(s).

 09/02/04 15:58:22 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 1 time(s).

 09/02/04 15:58:23 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 2 time(s).

 09/02/04 15:58:24 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 3 time(s).

 09/02/04 15:58:25 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 4 time(s).

 09/02/04 15:58:26 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 5 time(s).

 09/02/04 15:58:27 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310. Already tried 6 time(s).

 09/02/04 15:58:28 INFO ipc.Client: Retrying connect to server: localhost/

 127.0.0.1:54310.

Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS

Forgot to say, value 0 means that the requested counter does not exist.

2009/2/5 Rasit OZDAS rasitoz...@gmail.com:
 Sharath,
  I think the static enum definition should be out of Reduce class.
 Hadoop probably tries to find it elsewhere with MyCounter, but it's
 actually Reduce.MyCounter in your example.

 Hope this helps,
 Rasit

 2009/2/5 some speed speed.s...@gmail.com:
 I Tried the following...It gets compiled but the value of result seems to be
 0 always.

RunningJob running = JobClient.runJob(conf);

 Counters ct = new Counters();
 ct = running.getCounters();

long result =
 ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 0,
 *MyCounter*).getCounter();
 //even tried MyCounter.Key1



 Does anyone know whay that is happening?

 Thanks,

 Sharath



 On Thu, Feb 5, 2009 at 5:59 AM, some speed speed.s...@gmail.com wrote:

 Hi Tom,

 I get the error :

 Cannot find Symbol* **MyCounter.ct_key1  *






 On Thu, Feb 5, 2009 at 5:51 AM, Tom White t...@cloudera.com wrote:

 Hi Sharath,

 The code you posted looks right to me. Counters#getCounter() will
 return the counter's value. What error are you getting?

 Tom

 On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com wrote:
  Hi,
 
  Can someone help me with the usage of counters please? I am incrementing
 a
  counter in Reduce method but I am unable to collect the counter value
 after
  the job is completed.
 
  Its something like this:
 
  public static class Reduce extends MapReduceBase implements
 ReducerText,
  FloatWritable, Text, FloatWritable
 {
 static enum MyCounter{ct_key1};
 
  public void reduce(..) throws IOException
 {
 
 reporter.incrCounter(MyCounter.ct_key1, 1);
 
 output.collect(..);
 
 }
  }
 
  -main method
  {
 RunningJob running = null;
 running=JobClient.runJob(conf);
 
 Counters ct = running.getCounters();
  /*  How do I Collect the ct_key1 value ***/
 long res = ct.getCounter(MyCounter.ct_key1);
 
  }
 
 
 
 
 
  Thanks,
 
  Sharath
 







 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS

Sharath,
 I think the static enum definition should be out of Reduce class.
Hadoop probably tries to find it elsewhere with MyCounter, but it's
actually Reduce.MyCounter in your example.

Hope this helps,
Rasit

2009/2/5 some speed speed.s...@gmail.com:
 I Tried the following...It gets compiled but the value of result seems to be
 0 always.

RunningJob running = JobClient.runJob(conf);

 Counters ct = new Counters();
 ct = running.getCounters();

long result =
 ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 0,
 *MyCounter*).getCounter();
 //even tried MyCounter.Key1



 Does anyone know whay that is happening?

 Thanks,

 Sharath



 On Thu, Feb 5, 2009 at 5:59 AM, some speed speed.s...@gmail.com wrote:

 Hi Tom,

 I get the error :

 Cannot find Symbol* **MyCounter.ct_key1  *






 On Thu, Feb 5, 2009 at 5:51 AM, Tom White t...@cloudera.com wrote:

 Hi Sharath,

 The code you posted looks right to me. Counters#getCounter() will
 return the counter's value. What error are you getting?

 Tom

 On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com wrote:
  Hi,
 
  Can someone help me with the usage of counters please? I am incrementing
 a
  counter in Reduce method but I am unable to collect the counter value
 after
  the job is completed.
 
  Its something like this:
 
  public static class Reduce extends MapReduceBase implements
 ReducerText,
  FloatWritable, Text, FloatWritable
 {
 static enum MyCounter{ct_key1};
 
  public void reduce(..) throws IOException
 {
 
 reporter.incrCounter(MyCounter.ct_key1, 1);
 
 output.collect(..);
 
 }
  }
 
  -main method
  {
 RunningJob running = null;
 running=JobClient.runJob(conf);
 
 Counters ct = running.getCounters();
  /*  How do I Collect the ct_key1 value ***/
 long res = ct.getCounter(MyCounter.ct_key1);
 
  }
 
 
 
 
 
  Thanks,
 
  Sharath
 







-- 
M. Raşit ÖZDAŞ

Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS

Sharath,

You're using  reporter.incrCounter(enumVal, intVal);  to increment counter,
I think method to get should also be similar.

Try to use findCounter(enumVal).getCounter() or  getCounter(enumVal).

Hope this helps,
Rasit

2009/2/5 some speed speed.s...@gmail.com:
 In fact I put the enum in my Reduce method as the following link (from
 Yahoo) says so:

 http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html#metrics
 ---Look at the section under Reporting Custom Metrics.

 2009/2/5 some speed speed.s...@gmail.com

 Thanks Rasit.

 I did as you said.

 1) Put the static enum MyCounter{ct_key1} just above main()

 2) Changed  result =
 ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 1,
 Reduce.MyCounter).getCounter();

 Still is doesnt seem to help. It throws a null pointer exception.Its not
 able to find the Counter.



 Thanks,

 Sharath




 On Thu, Feb 5, 2009 at 8:04 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Forgot to say, value 0 means that the requested counter does not exist.

 2009/2/5 Rasit OZDAS rasitoz...@gmail.com:
  Sharath,
   I think the static enum definition should be out of Reduce class.
  Hadoop probably tries to find it elsewhere with MyCounter, but it's
  actually Reduce.MyCounter in your example.
 
  Hope this helps,
  Rasit
 
  2009/2/5 some speed speed.s...@gmail.com:
  I Tried the following...It gets compiled but the value of result seems
 to be
  0 always.
 
 RunningJob running = JobClient.runJob(conf);
 
  Counters ct = new Counters();
  ct = running.getCounters();
 
 long result =
  ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 0,
  *MyCounter*).getCounter();
  //even tried MyCounter.Key1
 
 
 
  Does anyone know whay that is happening?
 
  Thanks,
 
  Sharath
 
 
 
  On Thu, Feb 5, 2009 at 5:59 AM, some speed speed.s...@gmail.com
 wrote:
 
  Hi Tom,
 
  I get the error :
 
  Cannot find Symbol* **MyCounter.ct_key1  *
 
 
 
 
 
 
  On Thu, Feb 5, 2009 at 5:51 AM, Tom White t...@cloudera.com wrote:
 
  Hi Sharath,
 
  The code you posted looks right to me. Counters#getCounter() will
  return the counter's value. What error are you getting?
 
  Tom
 
  On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com
 wrote:
   Hi,
  
   Can someone help me with the usage of counters please? I am
 incrementing
  a
   counter in Reduce method but I am unable to collect the counter
 value
  after
   the job is completed.
  
   Its something like this:
  
   public static class Reduce extends MapReduceBase implements
  ReducerText,
   FloatWritable, Text, FloatWritable
  {
  static enum MyCounter{ct_key1};
  
   public void reduce(..) throws IOException
  {
  
  reporter.incrCounter(MyCounter.ct_key1, 1);
  
  output.collect(..);
  
  }
   }
  
   -main method
   {
  RunningJob running = null;
  running=JobClient.runJob(conf);
  
  Counters ct = running.getCounters();
   /*  How do I Collect the ct_key1 value ***/
  long res = ct.getCounter(MyCounter.ct_key1);
  
   }
  
  
  
  
  
   Thanks,
  
   Sharath
  
 
 
 
 
 
 
 
  --
  M. Raşit ÖZDAŞ
 



 --
 M. Raşit ÖZDAŞ







-- 
M. Raşit ÖZDAŞ

Re: Not able to copy a file to HDFS after installing

2009-02-05 Thread Rasit OZDAS

Rajshekar,
It seems that your namenode isn't able to load FsImage file.

Here is a thread about a similar issue:
http://www.nabble.com/Hadoop-0.17.1-%3D%3E-EOFException-reading-FSEdits-file,-what-causes-this---how-to-prevent--td21440922.html

Rasit

2009/2/5 Rajshekar rajasheka...@excelindia.com:

 Name naode is localhost with an ip address.Now I checked when i give
 /bin/hadoop namenode i am getting error

 r...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1# bin/hadoop namenode
 09/02/05 13:27:43 INFO dfs.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = excel-desktop/127.0.1.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.17.2.1
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 684969;
 compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
 /
 09/02/05 13:27:43 INFO metrics.RpcMetrics: Initializing RPC Metrics with
 hostName=NameNode, port=9000
 09/02/05 13:27:43 INFO dfs.NameNode: Namenode up at:
 localhost/127.0.0.1:9000
 09/02/05 13:27:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=NameNode, sessionId=null
 09/02/05 13:27:43 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics
 using context object:org.apache.hadoop.metrics.spi.NullContext
 09/02/05 13:27:43 INFO fs.FSNamesystem: fsOwner=root,root
 09/02/05 13:27:43 INFO fs.FSNamesystem: supergroup=supergroup
 09/02/05 13:27:43 INFO fs.FSNamesystem: isPermissionEnabled=true
 09/02/05 13:27:44 INFO ipc.Server: Stopping server on 9000
 09/02/05 13:27:44 ERROR dfs.NameNode: java.io.EOFException
at java.io.RandomAccessFile.readInt(RandomAccessFile.java:776)
at
 org.apache.hadoop.dfs.FSImage.isConversionNeeded(FSImage.java:488)
at
 org.apache.hadoop.dfs.Storage$StorageDirectory.analyzeStorage(Storage.java:283)
at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:149)
at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

 09/02/05 13:27:44 INFO dfs.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at excel-desktop/127.0.1.1
 /
  Rajshekar





 Sagar Naik-3 wrote:


 where is the namenode running ? localhost or some other host

 -Sagar
 Rajshekar wrote:
 Hello,
 I am new to Hadoop and I jus installed on Ubuntu 8.0.4 LTS as per
 guidance
 of a web site. I tested it and found working fine. I tried to copy a file
 but it is giving some error pls help me out

 had...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1$  bin/hadoop jar
 hadoop-0.17.2.1-examples.jar wordcount /home/hadoop/Download\ URLs.txt
 download-output
 09/02/02 11:18:59 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 1 time(s).
 09/02/02 11:19:00 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 2 time(s).
 09/02/02 11:19:01 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 3 time(s).
 09/02/02 11:19:02 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 4 time(s).
 09/02/02 11:19:04 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 5 time(s).
 09/02/02 11:19:05 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 6 time(s).
 09/02/02 11:19:06 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 7 time(s).
 09/02/02 11:19:07 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 8 time(s).
 09/02/02 11:19:08 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 9 time(s).
 09/02/02 11:19:09 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:9000. Already tried 10 time(s).
 java.lang.RuntimeException: java.net.ConnectException: Connection refused
 at org.apache.hadoop.mapred.JobConf.getWorkingDirecto
 ry(JobConf.java:356)
 at org.apache.hadoop.mapred.FileInputFormat.setInputP
 aths(FileInputFormat.java:331)
 at org.apache.hadoop.mapred.FileInputFormat.setInputP
 aths(FileInputFormat.java:304)
 at org.apache.hadoop.examples.WordCount.run(WordCount .java:146)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.j ava:65)
 at

Re: Regarding Hadoop multi cluster set-up

2009-02-05 Thread Rasit OZDAS

Ian,
here is a list under
Setting up Hadoop on a single node  Basic Configuration  Jobtracker
and Namenode settings
Maybe it's what you're looking for.

Cheers,
Rasit

2009/2/4 Ian Soboroff ian.sobor...@nist.gov:
 I would love to see someplace a complete list of the ports that the various
 Hadoop daemons expect to have open.  Does anyone have that?

 Ian

 On Feb 4, 2009, at 1:16 PM, shefali pawar wrote:


 Hi,

 I will have to check. I can do that tomorrow in college. But if that is
 the case what should i do?

 Should i change the port number and try again?

 Shefali

 On Wed, 04 Feb 2009 S D wrote :

 Shefali,

 Is your firewall blocking port 54310 on the master?

 John

 On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar
 shefal...@rediffmail.comwrote:

 Hi,

 I am trying to set-up a two node cluster using Hadoop0.19.0, with 1
 master(which should also work as a slave) and 1 slave node.

 But while running bin/start-dfs.sh the datanode is not starting on the
 slave. I had read the previous mails on the list, but nothing seems to
 be
 working in this case. I am getting the following error in the
 hadoop-root-datanode-slave log file while running the command
 bin/start-dfs.sh =

 2009-02-03 13:00:27,516 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = slave/172.16.0.32
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 0 time(s).
 2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 1 time(s).
 2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 2 time(s).
 2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 3 time(s).
 2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 4 time(s).
 2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 5 time(s).
 2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 6 time(s).
 2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 7 time(s).
 2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 8 time(s).
 2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 9 time(s).
 2009-02-03 13:00:37,738 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
 Call
 to master/172.16.0.46:54310 failed on local exception: No route to host
  at org.apache.hadoop.ipc.Client.call(Client.java:699)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at $Proxy4.getProtocolVersion(Unknown Source)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
  at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:205)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
  at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)
 Caused by: java.net.NoRouteToHostException: No route to host
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
  at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
  at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
  at
 org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
  at org.apache.hadoop.ipc.Client.getConnection(Client.java:772)
  at org.apache.hadoop.ipc.Client.call(Client.java:685)
  ... 12 more

 2009-02-03 13:00:37,739 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /

Re: Not able to copy a file to HDFS after installing

2009-02-05 Thread Rasit OZDAS

Rajshekar,
I have also threads for this ;)

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200803.mbox/%3cpine.lnx.4.64.0803132200480.5...@localhost.localdomain%3e
http://www.mail-archive.com/hadoop-...@lucene.apache.org/msg03226.html

Please try the following:

- Give local filepath for jar
- Give absolute path, not relative to the hadoop/bin
- HADOOP_HOME env. variable should be correctly set.

Hope this helps,
Rasit

2009/2/6 Rajshekar rajasheka...@excelindia.com:

 Hi
 Thanks Rasi,

 From Yest evening I am able to start Namenode. I did few changed in
 hadoop-site.xml. it working now, but the new problem is I am not able to do
 map/reduce jobs using .jar files. it is giving following error

 had...@excel-desktop:/usr/local/hadoop$ bin/hadoop jar
 hadoop-0.19.0-examples.jar wordcount gutenberg gutenberg-output
 java.io.IOException: Error opening job jar: hadoop-0.19.0-examples.jar
at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
 Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.init(ZipFile.java:131)
at java.util.jar.JarFile.init(JarFile.java:150)
at java.util.jar.JarFile.init(JarFile.java:87)
at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
... 4 more

 Pls help me out



 Rasit OZDAS wrote:

 Rajshekar,
 It seems that your namenode isn't able to load FsImage file.

 Here is a thread about a similar issue:
 http://www.nabble.com/Hadoop-0.17.1-%3D%3E-EOFException-reading-FSEdits-file,-what-causes-this---how-to-prevent--td21440922.html

 Rasit

 2009/2/5 Rajshekar rajasheka...@excelindia.com:

 Name naode is localhost with an ip address.Now I checked when i give
 /bin/hadoop namenode i am getting error

 r...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1# bin/hadoop namenode
 09/02/05 13:27:43 INFO dfs.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = excel-desktop/127.0.1.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.17.2.1
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r
 684969;
 compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
 /
 09/02/05 13:27:43 INFO metrics.RpcMetrics: Initializing RPC Metrics with
 hostName=NameNode, port=9000
 09/02/05 13:27:43 INFO dfs.NameNode: Namenode up at:
 localhost/127.0.0.1:9000
 09/02/05 13:27:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=NameNode, sessionId=null
 09/02/05 13:27:43 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics
 using context object:org.apache.hadoop.metrics.spi.NullContext
 09/02/05 13:27:43 INFO fs.FSNamesystem: fsOwner=root,root
 09/02/05 13:27:43 INFO fs.FSNamesystem: supergroup=supergroup
 09/02/05 13:27:43 INFO fs.FSNamesystem: isPermissionEnabled=true
 09/02/05 13:27:44 INFO ipc.Server: Stopping server on 9000
 09/02/05 13:27:44 ERROR dfs.NameNode: java.io.EOFException
at java.io.RandomAccessFile.readInt(RandomAccessFile.java:776)
at
 org.apache.hadoop.dfs.FSImage.isConversionNeeded(FSImage.java:488)
at
 org.apache.hadoop.dfs.Storage$StorageDirectory.analyzeStorage(Storage.java:283)
at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:149)
at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

 09/02/05 13:27:44 INFO dfs.NameNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NameNode at excel-desktop/127.0.1.1
 /
  Rajshekar





 Sagar Naik-3 wrote:


 where is the namenode running ? localhost or some other host

 -Sagar
 Rajshekar wrote:
 Hello,
 I am new to Hadoop and I jus installed on Ubuntu 8.0.4 LTS as per
 guidance
 of a web site. I tested it and found working fine. I tried to copy a
 file
 but it is giving some error pls help me out

 had...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1$  bin/hadoop jar
 hadoop-0.17.2.1-examples.jar wordcount /home/hadoop/Download\ URLs.txt
 download

Re: Value-Only Reduce Output

2009-02-04 Thread Rasit OZDAS

I tried it myself, it doesn't work.
I've also tried   stream.map.output.field.separator   and
map.output.key.field.separator  parameters for this purpose, they
don't work either. When hadoop sees empty string, it takes default tab
character instead.

Rasit

2009/2/4 jason hadoop jason.had...@gmail.com

 Ooops, you are using streaming., and I am not familar.
 As a terrible hack, you could set mapred.textoutputformat.separator to the
 empty string, in your configuration.

 On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop jason.had...@gmail.com wrote:

  If you are using the standard TextOutputFormat, and the output collector is
  passed a null for the value, there will not be a trailing tab character
  added to the output line.
 
  output.collect( key, null );
  Will give you the behavior you are looking for if your configuration is as
  I expect.
 
 
  On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl j...@yelp.com wrote:
 
  Hello,
 
  I'm interested in a map-reduce flow where I output only values (no keys)
  in
  my reduce step.  For example, imagine the canonical word-counting program
  where I'd like my output to be an unlabeled histogram of counts instead of
  (word, count) pairs.
 
  I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
  my python scripts).  When I simulate the map reduce using pipes and sort
  in
  bash, it works fine.   However, in Hadoop, if I output a value with no
  tabs,
  Hadoop appends a trailing \t, apparently interpreting my output as a
  (value, ) KV pair.  I'd like to avoid outputing this trailing tab if
  possible.
 
  Is there a command line option that could be use to effect this?  More
  generally, is there something wrong with outputing arbitrary strings,
  instead of key-value pairs, in your reduce step?
 
 
 



--
M. Raşit ÖZDAŞ

Re: Hadoop FS Shell - command overwrite capability

2009-02-04 Thread Rasit OZDAS

John, I also couldn't find a way from console,
Maybe you already know and don't prefer to use, but API solves this problem.
FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path
src, Path dst)

If you have to use console, long solution, but you can create a jar
for this, and call it just like hadoop calls FileSystem class in
hadoop file in bin directory.

I think File System API also needs some improvement. I wonder if it's
considered by head developers.

Hope this helps,
Rasit

2009/2/4 S D sd.codewarr...@gmail.com:
 I'm using the Hadoop FS commands to move files from my local machine into
 the Hadoop dfs. I'd like a way to force a write to the dfs even if a file of
 the same name exists. Ideally I'd like to use a -force switch or some
 such; e.g.,
hadoop dfs -copyFromLocal -force adirectory s3n://wholeinthebucket/

 Is there a way to do this or does anyone know if this is in the future
 Hadoop plans?

 Thanks
 John SD




-- 
M. Raşit ÖZDAŞ

Re: How to use DBInputFormat?

2009-02-04 Thread Rasit OZDAS

Amandeep,
SQL command not properly ended
I get this error whenever I forget the semicolon at the end.
I know, it doesn't make sense, but I recommend giving it a try

Rasit

2009/2/4 Amandeep Khurana ama...@gmail.com:
 The same query is working if I write a simple JDBC client and query the
 database. So, I'm probably doing something wrong in the connection settings.
 But the error looks to be on the query side more than the connection side.

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote:

 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

 at
 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at LoadTable1.run(LoadTable1.java:130)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at LoadTable1.main(LoadTable1.java:107)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Exception closing file
 /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
 at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
 at
 org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
 at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
 at
 org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
 at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
 at
 org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)


 Here's my code:

 public class LoadTable1 extends Configured implements Tool  {

   // data destination on hdfs
   private static final String CONTRACT_OUTPUT_PATH = contract_table;

   // The JDBC connection URL and driver implementation class

 private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost
 :1521:PSEDEV;
   private static final String DB_USER = user;
   private static final String DB_PWD = pass;
   private static final String DATABASE_DRIVER_CLASS =
 oracle.jdbc.driver.OracleDriver;

   private static final String CONTRACT_INPUT_TABLE =
 OSE_EPR_CONTRACT;

   private static final String [] CONTRACT_INPUT_TABLE_FIELDS = {
 PORTFOLIO_NUMBER, CONTRACT_NUMBER};

   private static final String ORDER_CONTRACT_BY_COL =
 CONTRACT_NUMBER;


 static class ose_epr_contract implements Writable, DBWritable {


 String CONTRACT_NUMBER;


 public void readFields(DataInput in) throws IOException {

 this.CONTRACT_NUMBER = Text.readString(in);

 }

 public void write(DataOutput out) throws IOException {

 Text.writeString(out, this.CONTRACT_NUMBER);


 }

 public void readFields(ResultSet in_set) throws SQLException {

 this.CONTRACT_NUMBER = in_set.getString(1);

 }

 @Override
 public void write(PreparedStatement prep_st) throws SQLException {
 // TODO Auto-generated method stub

 }

 }

 public static class LoadMapper extends MapReduceBase
 implements MapperLongWritable,
 ose_epr_contract, Text,

A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS

Hi,
I tried to use SequenceFileInputFormat, for this I appended SEQ as first
bytes of my binary files (with hex editor).
but I get this exception:

A record version mismatch occured. Expecting v6, found v32
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

What could it be? Is it not enough just to add SEQ to binary files?
I use Hadoop v.0.19.0 .

Thanks in advance..
Rasit


different *version* of *Hadoop* between your server and your client.

-- 
M. Raşit ÖZDAŞ

Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS

I tried to use SequenceFile.Writer to convert my binaries into Sequence
Files,
I read the binary data with FileInputStream, getting all bytes with
reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
parameters NullWritable as key, BytesWritable as value. But the content
changes,
(I can see that by converting to Base64)

Binary File:
73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65
103 54 81 65 65 97 81 65 65 65 81 ...

Sequence File:
73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67
69 77 65 52 80 86 67 65 73 68 114 ...

Thanks for any points..
Rasit

2009/2/2 Rasit OZDAS rasitoz...@gmail.com

 Hi,
 I tried to use SequenceFileInputFormat, for this I appended SEQ as first
 bytes of my binary files (with hex editor).
 but I get this exception:

 A record version mismatch occured. Expecting v6, found v32
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
 at
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)

 What could it be? Is it not enough just to add SEQ to binary files?
 I use Hadoop v.0.19.0 .

 Thanks in advance..
 Rasit


 different *version* of *Hadoop* between your server and your client.

 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS

Thanks, Tom
The problem that content was different was that
I converted one sample to Base64 byte-by-byte, and converted the other
from-byte-array to-byte-array (Strange, that they cause different outputs).
Thanks for good points.

Rasit

2009/2/2 Tom White t...@cloudera.com

 The SequenceFile format is described here:

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html
 .
 The format of the keys and values depends on the serialization classes
 used. For example, BytesWritable writes out the length of its byte
 array followed by the actual bytes in the array (see the write()
 method in BytesWritable).

 Hope this helps.
 Tom

 On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote:
  I tried to use SequenceFile.Writer to convert my binaries into Sequence
  Files,
  I read the binary data with FileInputStream, getting all bytes with
  reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
  parameters NullWritable as key, BytesWritable as value. But the content
  changes,
  (I can see that by converting to Base64)
 
  Binary File:
  73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81
 65
  103 54 81 65 65 97 81 65 65 65 81 ...
 
  Sequence File:
  73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103
 67
  69 77 65 52 80 86 67 65 73 68 114 ...
 
  Thanks for any points..
  Rasit
 
  2009/2/2 Rasit OZDAS rasitoz...@gmail.com
 
  Hi,
  I tried to use SequenceFileInputFormat, for this I appended SEQ as
 first
  bytes of my binary files (with hex editor).
  but I get this exception:
 
  A record version mismatch occured. Expecting v6, found v32
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
  at
 
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
  at org.apache.hadoop.mapred.Child.main(Child.java:155)
 
  What could it be? Is it not enough just to add SEQ to binary files?
  I use Hadoop v.0.19.0 .
 
  Thanks in advance..
  Rasit
 
 
  different *version* of *Hadoop* between your server and your client.
 
  --
  M. Raşit ÖZDAŞ
 
 
 
 
  --
  M. Raşit ÖZDAŞ
 




-- 
M. Raşit ÖZDAŞ

Re: Is Hadoop Suitable for me?

2009-01-29 Thread Rasit OZDAS

Thanks for responses, the problem is solved :)
I'll be forwarding the thread to my colleagues.

2009/1/29 nitesh bhatia niteshbhatia...@gmail.com

 HDFS is a file system for distributed storage typically for distributed
 computing scenerio over hadoop. For office purpose you will require a SAN
 (Storage Area Network) - an architecture to attach remote computer storage
 devices to servers in such a way that, to the operating system, the devices
 appear as locally attached. Or you can even go for AmazonS3, if the data is
 really authentic. For opensource solution related to SAN, you can go with
 any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
 zones) or perhaps best plug-n-play solution (non-open-source) would be a
 Mac
 Server + XSan.

 --nitesh

 On Wed, Jan 28, 2009 at 10:20 PM, Simon sim...@bigair.net.au wrote:

  But we are looking for an open source solution.
 
  If I do decide to implement this for the office storage, what problems
 will
  I run into?
 
  -Original Message-
  From: Dmitry Pushkarev [mailto:u...@stanford.edu]
  Sent: Thursday, 29 January 2009 5:15 PM
  To: core-user@hadoop.apache.org
   Cc: sim...@bigair.net.au
  Subject: RE: Is Hadoop Suitable for me?
 
  Definitely not,
 
  You should be looking at expandable Ethernet storage that can be extended
  by
  connecting additional SAS arrays. (like dell powervault and similar
 things
  from other companies)
 
  600Mb is just 6 seconds over gigabit network...
 
  ---
  Dmitry Pushkarev
 
 
  -Original Message-
  From: Simon [mailto:sim...@bigair.net.au]
  Sent: Wednesday, January 28, 2009 10:02 PM
  To: core-user@hadoop.apache.org
  Subject: Is Hadoop Suitable for me?
 
  Hi Hadoop Users,
 
 
  I am trying to build a storage system for the office of about 20-30 users
  which will store everything.
 
  From normal everyday documents to computer configuration files to big
 files
  (600mb) which are generated every hour.
 
 
 
  Is Hadoop suitable for this kind of environment?
 
 
 
  Regards,
 
  Simon
 
 
 
  No virus found in this incoming message.
  Checked by AVG - http://www.avg.com
  Version: 8.0.176 / Virus Database: 270.10.15/1921 - Release Date:
 1/28/2009
  6:37 AM
 
 


 --
 Nitesh Bhatia
 Dhirubhai Ambani Institute of Information  Communication Technology
 Gandhinagar
 Gujarat

 Life is never perfect. It just depends where you draw the line.

 visit:
 http://www.awaaaz.com - connecting through music
 http://www.volstreet.com - lets volunteer for better tomorrow
 http://www.instibuzz.com - Voice opinions, Transact easily, Have fun




-- 
M. Raşit ÖZDAŞ

Re: Is Hadoop Suitable for me?

2009-01-29 Thread Rasit OZDAS

Oh, I can't believe, my problem was the same, I thought last one was an
answer to my thread.
Who cares, the problem is solved, thanks!

2009/1/29 Rasit OZDAS rasitoz...@gmail.com

 Thanks for responses, the problem is solved :)
 I'll be forwarding the thread to my colleagues.

 2009/1/29 nitesh bhatia niteshbhatia...@gmail.com

 HDFS is a file system for distributed storage typically for distributed
 computing scenerio over hadoop. For office purpose you will require a SAN
 (Storage Area Network) - an architecture to attach remote computer storage
 devices to servers in such a way that, to the operating system, the
 devices
 appear as locally attached. Or you can even go for AmazonS3, if the data
 is
 really authentic. For opensource solution related to SAN, you can go with
 any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
 zones) or perhaps best plug-n-play solution (non-open-source) would be a
 Mac
 Server + XSan.

 --nitesh

 On Wed, Jan 28, 2009 at 10:20 PM, Simon sim...@bigair.net.au wrote:

  But we are looking for an open source solution.
 
  If I do decide to implement this for the office storage, what problems
 will
  I run into?
 
  -Original Message-
  From: Dmitry Pushkarev [mailto:u...@stanford.edu]
  Sent: Thursday, 29 January 2009 5:15 PM
  To: core-user@hadoop.apache.org
   Cc: sim...@bigair.net.au
  Subject: RE: Is Hadoop Suitable for me?
 
  Definitely not,
 
  You should be looking at expandable Ethernet storage that can be
 extended
  by
  connecting additional SAS arrays. (like dell powervault and similar
 things
  from other companies)
 
  600Mb is just 6 seconds over gigabit network...
 
  ---
  Dmitry Pushkarev
 
 
  -Original Message-
  From: Simon [mailto:sim...@bigair.net.au]
  Sent: Wednesday, January 28, 2009 10:02 PM
  To: core-user@hadoop.apache.org
  Subject: Is Hadoop Suitable for me?
 
  Hi Hadoop Users,
 
 
  I am trying to build a storage system for the office of about 20-30
 users
  which will store everything.
 
  From normal everyday documents to computer configuration files to big
 files
  (600mb) which are generated every hour.
 
 
 
  Is Hadoop suitable for this kind of environment?
 
 
 
  Regards,
 
  Simon
 
 
 
  No virus found in this incoming message.
  Checked by AVG - http://www.avg.com
  Version: 8.0.176 / Virus Database: 270.10.15/1921 - Release Date:
 1/28/2009
  6:37 AM
 
 


 --
 Nitesh Bhatia
 Dhirubhai Ambani Institute of Information  Communication Technology
 Gandhinagar
 Gujarat

 Life is never perfect. It just depends where you draw the line.

 visit:
 http://www.awaaaz.com - connecting through music
 http://www.volstreet.com - lets volunteer for better tomorrow
 http://www.instibuzz.com - Voice opinions, Transact easily, Have fun




 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ

Re: Using HDFS for common purpose

2009-01-29 Thread Rasit OZDAS

Today Nitesh has given an answer to a similar thread, that was what I wanted
to learn.
I'm writing it here to help others having same question.

HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer storage
devices to servers in such a way that, to the operating system, the devices
appear as locally attached. Or you can even go for AmazonS3, if the data is
really authentic. For opensource solution related to SAN, you can go with
any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
Server + XSan.

--nitesh

Thanks,
Rasit

2009/1/28 Rasit OZDAS rasitoz...@gmail.com

Thanks for responses,

Sorry, I made a mistake, it's actually not a db what I wanted. We need a
simple storage for files. Only get and put commands are enough (no queries
needed). We don't even need append, chmod, etc.

Probably from a thread on this list, I came across a link to a KFS-HDFS
comparison:
http://deliberateambiguity.typepad.com/blog/2007/10/advantages-of-k.htmlhttps://webmail.uzay.tubitak.gov.tr/owa/redir.aspx?C=55b317b7ca7548209f9929c643fcbf93URL=http%3a%2f%2fdeliberateambiguity.typepad.com%2fblog%2f2007%2f10%2fadvantages-of-k.html

It's good, that KFS is written in C++, but handling errors in C++ is
usually more difficult.
I need your opinion about which one could best fit.

Thanks,
Rasit

2009/1/27 Jim Twensky jim.twen...@gmail.com

You may also want to have a look at this to reach a decision based on your
needs:

http://www.swaroopch.com/notes/Distributed_Storage_Systems

Jim

On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky jim.twen...@gmail.com
wrote:

Rasit,

What kind of data will you be storing on Hbase or directly on HDFS? Do
you
aim to use it as a data source to do some key/value lookups for small
strings/numbers or do you want to store larger files labeled with some
sort
of a key and retrieve them during a map reduce run?

Jim

On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray jl...@streamy.com
wrote:

Perhaps what you are looking for is HBase?

http://hbase.org

HBase is a column-oriented, distributed store that sits on top of HDFS
and
provides random access.

-Original Message-
From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
Sent: Tuesday, January 27, 2009 1:20 AM
To: core-user@hadoop.apache.org
Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr
;
hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr
;
hakan.kocaku...@uzay.tubitak.gov.tr;
caglar.bi...@uzay.tubitak.gov.tr
Subject: Using HDFS for common purpose

Hi,
I wanted to ask, if HDFS is a good solution just as a distributed db
(no
running jobs, only get and put commands)
A review says that HDFS is not designed for low latency and
besides,
it's
implemented in Java.
Do these disadvantages prevent us using it?
Or could somebody suggest a better (faster) one?

Thanks in advance..
Rasit

--
M. Raşit ÖZDAŞ

Re: Number of records in a MapFile

2009-01-28 Thread Rasit OZDAS

Do you mean, without scanning all the files line by line?
I know little about implementation of hadoop, but as a programmer, I can
presume that it's not possible without a complete scan.

But I can suggest a work-around:
- compute number of records manually before putting a file to HDFS.
- Append the computed number to the filename.
- modify InputReader, so that reader appends that number to the key of every
map.

Hope this helps,
Rasit

2009/1/27 Andy Liu andyliu1...@gmail.com

 Is there a way to programatically get the number of records in a MapFile
 without doing a complete scan?




-- 
M. Raşit ÖZDAŞ

Re: Netbeans/Eclipse plugin

2009-01-28 Thread Rasit OZDAS

Both DFS viewer and job submission work on eclipse v. 3.3.2.
I've given up using Ganymede, unfortunately..

2009/1/26 Aaron Kimball aa...@cloudera.com

 The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/)
 currently is inoperable. The DFS viewer works, but the job submission code
 is broken.

 - Aaron

 On Sun, Jan 25, 2009 at 9:07 PM, Amit k. Saha amitsaha...@gmail.com
 wrote:

  On Sun, Jan 25, 2009 at 9:32 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
   On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar 
 vinaykat...@gmail.com
  wrote:
   Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I
  want
   to make plugin for netbeans
  
   http://vinayakkatkar.wordpress.com
   --
   Vinayak Katkar
   Sun Campus Ambassador
   Sun Microsytems,India
   COEP
  
  
   There is an ecplipse plugin.
  http://www.alphaworks.ibm.com/tech/mapreducetools
  
   Seems like some work is being done on netbeans
   https://nbhadoop.dev.java.net/
 
  I started this project. But well, its caught up in the requirements
  gathering phase.
 
  @ Vinayak,
 
  Lets take this offline and discuss. What do you think?
 
 
  Thanks,
  Amit
 
  
   The world needs more netbeans love.
  
 
  Definitely :-)
 
 
  --
  Amit Kumar Saha
  http://amitksaha.blogspot.com
  http://amitsaha.in.googlepages.com/
  *Bangalore Open Java Users Group*:http:www.bojug.in
 




-- 
M. Raşit ÖZDAŞ

Re: Using HDFS for common purpose

2009-01-28 Thread Rasit OZDAS

Thanks for responses,

Sorry, I made a mistake, it's actually not a db what I wanted. We need a
simple storage for files. Only get and put commands are enough (no queries
needed). We don't even need append, chmod, etc.

Probably from a thread on this list, I came across a link to a KFS-HDFS
comparison:
http://deliberateambiguity.typepad.com/blog/2007/10/advantages-of-k.htmlhttps://webmail.uzay.tubitak.gov.tr/owa/redir.aspx?C=55b317b7ca7548209f9929c643fcbf93URL=http%3a%2f%2fdeliberateambiguity.typepad.com%2fblog%2f2007%2f10%2fadvantages-of-k.html

It's good, that KFS is written in C++, but handling errors in C++ is usually
more difficult.
I need your opinion about which one could best fit.

Thanks,
Rasit

2009/1/27 Jim Twensky jim.twen...@gmail.com

 You may also want to have a look at this to reach a decision based on your
 needs:

 http://www.swaroopch.com/notes/Distributed_Storage_Systems

 Jim

 On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky jim.twen...@gmail.com
 wrote:

  Rasit,
 
  What kind of data will you be storing on Hbase or directly on HDFS? Do
 you
  aim to use it as a data source to do some key/value lookups for small
  strings/numbers or do you want to store larger files labeled with some
 sort
  of a key and retrieve them during a map reduce run?
 
  Jim
 
 
  On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray jl...@streamy.com
 wrote:
 
  Perhaps what you are looking for is HBase?
 
  http://hbase.org
 
  HBase is a column-oriented, distributed store that sits on top of HDFS
 and
  provides random access.
 
  JG
 
   -Original Message-
   From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
   Sent: Tuesday, January 27, 2009 1:20 AM
   To: core-user@hadoop.apache.org
   Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
   hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
   hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
   Subject: Using HDFS for common purpose
  
   Hi,
   I wanted to ask, if HDFS is a good solution just as a distributed db
   (no
   running jobs, only get and put commands)
   A review says that HDFS is not designed for low latency and besides,
   it's
   implemented in Java.
   Do these disadvantages prevent us using it?
   Or could somebody suggest a better (faster) one?
  
   Thanks in advance..
   Rasit
 
 
 




-- 
M. Raşit ÖZDAŞ

Using HDFS for common purpose

2009-01-27 Thread Rasit OZDAS

Hi,
I wanted to ask, if HDFS is a good solution just as a distributed db (no
running jobs, only get and put commands)
A review says that HDFS is not designed for low latency and besides, it's
implemented in Java.
Do these disadvantages prevent us using it?
Or could somebody suggest a better (faster) one?

Thanks in advance..
Rasit

Re: Where are the meta data on HDFS ?

2009-01-27 Thread Rasit OZDAS

Hi Tien,

Configuration config = new Configuration(true);
config.addResource(new Path(/etc/hadoop-0.19.0/conf/hadoop-site.xml));

FileSystem fileSys = FileSystem.get(config);
BlockLocation[] locations = fileSys.getFileBlockLocations(.

I copied some lines of my code, it can also help if you prefer using the
API.
It has other useful features (methods) as well.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html


2009/1/24 tienduc_dinh tienduc_d...@yahoo.com


 that's what I needed !

 Thank you so much.
 --
 View this message in context:
 http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
M. Raşit ÖZDAŞ

Re: Null Pointer with Pattern file

2009-01-21 Thread Rasit OZDAS

Hi,
Try to use:

conf.setJarByClass(EchoOche.class);  // conf is the JobConf instance of your
example.

Hope this helps,
Rasit

2009/1/20 Shyam Sarkar shyam.s.sar...@gmail.com

 Hi,

 I was trying to run Hadoop wordcount version 2 example under Cygwin. I
 tried
 without pattern.txt file -- It works fine.
 I tried with pattern.txt file to skip some patterns, I get NULL POINTER
 exception as follows::

 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set.  User classes
 may not be found. See JobConf(Class) or JobConf#setJar(String).
 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
 : 4
 09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001
 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
 : 4
 09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1
 09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100
 09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720
 09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680
 09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001
 java.lang.NullPointerException
  at org.myorg.WordCount$Map.configure(WordCount.java:39)
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
  at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
  at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
  at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
  at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at org.myorg.WordCount.run(WordCount.java:114)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.myorg.WordCount.main(WordCount.java:119)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
  at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 Please tell me what should I do.

 Thanks,
 shyam.s.sar...@gmail.com




-- 
M. Raşit ÖZDAŞ

Re: Hadoop 0.17.1 = EOFException reading FSEdits file, what causes this? how to prevent?

2009-01-19 Thread Rasit OZDAS

I would prefer catching the EOFException in my own code,
assuming you are happy with the output before exception occurs.

Hope this helps,
Rasit

2009/1/16 Konstantin Shvachko s...@yahoo-inc.com

 Joe,

 It looks like you edits file is corrupted or truncated.
 Most probably the last modification was not written to it,
 when the name-node was turned off. This may happen if the
 node crashes depending on the underlying local file system I guess.

 Here are some options for you to consider:
 - try an alternative replica of the image directory if you had one.
 - try to edit the edits file if you know the internal format.
 - try to modify local copy of your name-node code, which should
 catch EOFException and ignore it.
 - Use a checkpointed image if you can afford to loose latest modifications
 to the fs.
 - Formatting of cause is the last resort since you loose everything.

 Thanks,
 --Konstantin


 Joe Montanez wrote:

 Hi:


 I'm using Hadoop 0.17.1 and I'm encountering EOFException reading the
 FSEdits file.  I don't have a clear understanding what is causing this
 and how to prevent this.  Has anyone seen this and can advise?


 Thanks in advance,

 Joe


 2009-01-12 22:51:45,573 ERROR org.apache.hadoop.dfs.NameNode:
 java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:180)

at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)

at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)

at
 org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:599)

at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)

at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)

at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)

at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)

at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)

at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)

at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)

at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)

at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)


 2009-01-12 22:51:45,574 INFO org.apache.hadoop.dfs.NameNode:
 SHUTDOWN_MSG:






-- 
M. Raşit ÖZDAŞ

Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Rasit OZDAS

Jim,

As far as I know, there is no operation done after Reducer.
At the first look, the situation reminds me of same keys for all the tasks,
This can be the result of one of following cases:
- input format reads same keys for every task.
- mapper collects every incoming key-value pairs under same key.
- reducer makes the same.

But if you  are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?

Hope this helps,
Rasit


2009/1/11 Jim Twensky jim.twen...@gmail.com:
 Hello,

 The original map-reduce paper states: After successful completion, the
 output of the map-reduce execution is available in the R output ﬁles (one
 per reduce task, with ﬁle names as speciﬁed by the user). However, when
 using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
 single file called part-0. I was wondering how and when this merging
 process is done. When the reducer calls output.collect(key,value), is this
 record written to a local temporary output file in the reducer's disk and
 then these local files (a total of R) are later merged into one single file
 with a final thread or is it directly written to the final output file
 (part-0)? I am asking this because I'd like to get an ordered sample of
 the final output data, ie. one record per every 1000 records or something
 similar and I don't want to run a serial process that iterates on the final
 output file.

 Thanks,
 Jim




-- 
M. Raşit ÖZDAŞ

Re: Dynamic Node Removal and Addition

2009-01-14 Thread Rasit OZDAS

Hi Alyssa,

http://markmail.org/message/jyo4wssouzlb4olm#query:%22Decommission%20of%20datanodes%22+page:1+mid:p2krkt6ebysrsrpl+state:results
as pointed here, decommission (removal) of datanodes was not an easy job at
the date of version 0.12.
I strongly think it's still not easy.
As far as I know, one node should be used both as datanode and tasktracker.
So, performance loss will be possibly far greater than performance gain of
your design.
My solution would be using them still as datanodes, and changing TaskTracker
code a little bit, so that they won't be used for jobs. Code manipulation
here should be easy, as I assume.

Hope this helps,
Rasit

2009/1/12 Hargraves, Alyssa aly...@wpi.edu:
Hello everyone,

I have a question and was hoping some on the mailinglist could offer some
pointers. I'm working on a project with another student and for part of this
project we are trying to create something that will allow nodes to be added
and removed from the hadoop cluster at will. The goal is to have the nodes
run a program that gives the user the freedom to add or remove themselves
from the cluster to take advantage of a workstation when the user leaves (or
if they'd like it running anyway when they're at the PC). This would be on
Windows computers of various different OSes.

From what we can find, hadoop does not already support this feature, but
it does seem to support dynamically adding nodes and removing nodes in other
ways. For example, to add a node, one would have to make sure hadoop is set
up on the PC along with cygwin, Java, and ssh, but after that initial setup
it's just a matter of adding the PC to the conf/slaves file, making sure the
node is not listed in the exclude file, and running the start datanode and
start tasktracker commands from the node you are adding (basically described
in FAQ item 25). To remove a node, it seems to be just a matter of adding
it to dfs.hosts.exclude and refreshing the list of nodes (described in
hadoop FAQ 17).

Our question is whether or not a simple interface for this already exists,
and whether or not anyone sees any potential flaws with how we are planning
to accomplish these tasks. From our research we were not able to find
anything that already exists for this purpose, but we find it surprising
that an interface for this would not already exist. We welcome any
comments, recommendations, and insights anyone might have for accomplishing
this task.

Thank you,
Alyssa Hargraves
Patrick Crane
WPI Class of 2009

--
M. Raşit ÖZDAŞ

83 matches

Mail list logo