Re: Best practices on spliltting an input line?

2009-02-12 Thread Rasit OZDAS
Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins andy.saut...@returnpath.net:


   I have question.  I've dabbled with different ways of tokenizing an
 input file line for processing.  I've noticed in my somewhat limited
 tests that there seem to be some pretty reasonable performance
 differences between different tokenizing methods.  For example, roughly
 it seems to split a line on tokens ( tab delimited in my case ) that
 Scanner is the slowest, followed by String.spit and StringTokenizer
 being the fastest.  StringTokenizer, for my application, has the
 unfortunate characteristic of not returning blank tokens ( i.e., parsing
 a,b,c,,d would return a,b,c,d instead of a,b,c,,d).
 The WordCount example uses StringTokenizer which makes sense to me,
 except I'm currently getting hung up on not returning blank tokens.  I
 did run across the com.Ostermiller.util StringTokenizer replacement that
 handles null/blank tokens
 (http://ostermiller.org/utils/StringTokenizer.html ) which seems
 possible to use, but it sure seems like someone else has solved this
 problem already better than I have.



   So, my question is, is there a best practice for splitting an input
 line especially when NULL tokens are expected ( i.e., two consecutive
 delimiter characters )?



   Any thoughts would be appreciated



   Thanks



   Andy





-- 
M. Raşit ÖZDAŞ


HDFS on non-identical nodes

2009-02-12 Thread Deepak
Hi,

We're running Hadoop cluster on 4 nodes, our primary purpose of
running is to provide distributed storage solution for internal
applications here in TellyTopia Inc.

Our cluster consists of non-identical nodes (one with 1TB another two
with 3 TB and one more with 60GB) while copying data on HDFS we
noticed that node with 60GB storage ran out of disk-space and even
balancer couldn't balance because cluster was stopped. Now my
questions are

1. Is Hadoop is suitable for non-identical cluster nodes?
2. Is there any way to automatically balancing of nodes?
3. Why Hadoop cluster stops when one node ran our of disk?

Any futher inputs are appericiapted!

Cheers,
Deepak
TellyTopia Inc.


Re: stable version

2009-02-12 Thread Anum Ali
Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



Thanks.

2009/2/12 Raghu Angadi rang...@yahoo-inc.com

 Vadim Zaliva wrote:

 The particular problem I am having is this one:

 https://issues.apache.org/jira/browse/HADOOP-2669

 I am observing it in version 19. Could anybody confirm that
 it have been fixed in 18, as Jira claims?

 I am wondering why bug fix for this problem might have been committed
 to 18 branch but not 19. If it was commited to both, then perhaps the
 problem was not completely solved and downgrading to 18 will not help
 me.


 If you read through the comments, it will see that the the root cause was
 never found. The patch just fixes one of the suspects. If you are still
 seeing this, please file another jira and link it HADOOP-2669.

 How easy is it for you reproduce this? I guess one of the reasons for
 incomplete diagnosis is that it is not simple to reproduce.

 Raghu.


  Vadim

 On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote:

 Yes, version 18.3 is the most stable one. It has added patches,
 without not-proven new functionality.

 2009/2/11 Owen O'Malley omal...@apache.org:

 On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:

  Maybe version 0.18
 is better suited for production environment?

 Yahoo is mostly on 0.18.3 + some patches at this point.

 -- Owen



 --
 M. Raşit ÖZDAŞ





Re: Backing up HDFS?

2009-02-12 Thread Stefan Podkowinski
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote:

 The key here is to prioritize your data.  Impossible to replicate data gets
 backed up using whatever means necessary, hard-to-regenerate data, next
 priority. Easy to regenerate and ok to nuke data, doesn't get backed up.


I think thats a good advise to start with when creating a backup strategy.
E.g. what we do at the moment is to analyze huge volumes of access
logs where we import those logs into hdfs, creating aggregates for
several metrics and finally storing results in sequence files using
block level compression. Its kind of an intermediate format that can
be used for further analysis. Those files end up being pretty small
and will be exported daily to storage and getting backuped. In case
hdfs goes to hell we can restore some raw log data from the servers
and only loose historical logs which should not be a big deal.

I must also add that I really enjoy the great deal of optimization
opportunities that hadoop gives you by directly implementing the
serialization strategies. You really get control over every bit and
byte that gets recorded. Same with compression. So you can make the
best trade offs possible and finally store only data you really need.


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.




You are using Java6, correct?


Re: Best practices on spliltting an input line?

2009-02-12 Thread Steve Loughran

Stefan Podkowinski wrote:

I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. a,a,b,c.


I've used it in the past; found it pretty reliable. Again, no perf 
tests, just reading in CSV files exported from spreadsheets


how to use FileSystem Object API

2009-02-12 Thread Ved Prakash
I m trying to run following code

public class localtohdfs {
public static void main()
{
  Configuration config = new Configuration();
  FileSystem hdfs = FileSystem.get(config);
  Path srcPath = new Path(/root/testfile);
  Path dstPath = new Path(testfile_hadoop);
  hdfs.copyFromLocalFile(srcPath, dstPath);
}
}

when I do javac i see
[r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java
localtohdfs.java:3: package org.apache.hadoop does not exist
import org.apache.hadoop.*;
^
localtohdfs.java:7: cannot find symbol
symbol  : class Configuration
location: class localtohdfs
  Configuration config = new Configuration();
  ^
localtohdfs.java:7: cannot find symbol
symbol  : class Configuration
location: class localtohdfs
  Configuration config = new Configuration();
 ^
localtohdfs.java:8: cannot find symbol
symbol  : class FileSystem
location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
  ^
localtohdfs.java:8: cannot find symbol
symbol  : variable FileSystem
location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
^
localtohdfs.java:9: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
  ^
localtohdfs.java:9: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
 ^
localtohdfs.java:10: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
  ^
localtohdfs.java:10: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
 ^
9 errors

CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib
PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1
JAVA_HOME=/usr/java/jdk1.6.0/

Where is the problem?

thanks and regards!


Re: how to use FileSystem Object API

2009-02-12 Thread ChingShen
you have to specify a hadoop-xxx-core.jar file.

for example:
javac -cp hadoop-xxx-core.jar localtohdfs.java


On Thu, Feb 12, 2009 at 7:58 PM, Ved Prakash meramailingl...@gmail.comwrote:

 I m trying to run following code

 public class localtohdfs {
 public static void main()
 {
  Configuration config = new Configuration();
  FileSystem hdfs = FileSystem.get(config);
  Path srcPath = new Path(/root/testfile);
  Path dstPath = new Path(testfile_hadoop);
  hdfs.copyFromLocalFile(srcPath, dstPath);
 }
 }

 when I do javac i see
 [r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java
 localtohdfs.java:3: package org.apache.hadoop does not exist
 import org.apache.hadoop.*;
 ^
 localtohdfs.java:7: cannot find symbol
 symbol  : class Configuration
 location: class localtohdfs
  Configuration config = new Configuration();
  ^
 localtohdfs.java:7: cannot find symbol
 symbol  : class Configuration
 location: class localtohdfs
  Configuration config = new Configuration();
 ^
 localtohdfs.java:8: cannot find symbol
 symbol  : class FileSystem
 location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
  ^
 localtohdfs.java:8: cannot find symbol
 symbol  : variable FileSystem
 location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
^
 localtohdfs.java:9: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
  ^
 localtohdfs.java:9: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
 ^
 localtohdfs.java:10: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
  ^
 localtohdfs.java:10: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
 ^
 9 errors


 CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib

 PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
 HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1
 JAVA_HOME=/usr/java/jdk1.6.0/

 Where is the problem?

 thanks and regards!



Re: stable version

2009-02-12 Thread Anum Ali
yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:

 Anum Ali wrote:

 Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
 regarding running its examples/file from eclipse.


 It gives error for

 Exception in thread main java.lang.UnsupportedOperationException: This
 parser does not support specification null version null  at

 javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

 Can anyone reslove or give some idea about it.



 You are using Java6, correct?



Re: Finding small subset in very large dataset

2009-02-12 Thread Thibaut_

Thanks,

I didn't think about the bloom filter variant. That's the solution I was
looking for :-)

Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-12 Thread Miles Osborne
Bloom Filters are one of the greatest things ever, so it is nice to
see another application.

Remember that your filter may make mistakes  -you will see items that
are not in the set.  Also, instead of setting a single bit per item
(in the A set), set k distinct bits.

You can analytically work-out the best k for a given number of items
and for some amount of memory.  In practice, this usually boils-down
to k being 3 or so for a reasonable error rate.

Happy hunting

Miles

2009/2/12 Thibaut_ tbr...@blue.lu:

 Thanks,

 I didn't think about the bloom filter variant. That's the solution I was
 looking for :-)

 Thibaut
 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Re: Re: Re: Re: Re: Regarding Hadoop multi cluster set-up

2009-02-12 Thread shefali pawar
I changed the value... It is still not working!

Shefali

On Tue, 10 Feb 2009 22:23:10 +0530  wrote
in hadoop-site.xml
change master:54311

to hdfs://master:54311


--nitesh

On Tue, Feb 10, 2009 at 9:50 PM, shefali pawar wrote:

 I tried that, but it is not working either!

 Shefali

 On Sun, 08 Feb 2009 05:27:54 +0530  wrote
 I ran into this trouble again. This time, formatting the namenode didnt
 help. So, I changed the directories where the metadata and the data was
 being stored. That made it work.
 
 You might want to check this up at your end too.
 
 Amandeep
 
 PS: I dont have an explanation for how and why this made it work.
 
 
 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz
 
 
 On Sat, Feb 7, 2009 at 9:06 AM, jason hadoop  wrote:
 
  On your master machine, use the netstat command to determine what ports
 and
  addresses the namenode process is listening on.
 
  On the datanode machines, examine the log files,, to verify that the
  datanode has attempted to connect to the namenode ip address on one of
  those
  ports, and was successfull.
 
  The common ports used for datanode - namenode rondevu are 50010, 54320
 and
  8020, depending on your hadoop version
 
  If the datanodes have been started, and the connection to the namenode
  failed, there will be a log message with a socket error, indicating what
  host and port the datanode used to attempt to communicate with the
  namenode.
  Verify that that ip address is correct for your namenode, and reachable
  from
  the datanode host (for multi homed machines this can be an issue), and
 that
  the port listed is one of the tcp ports that the namenode process is
  listing
  on.
 
  For linux, you can use command
  *netstat -a -t -n -p | grep java | grep LISTEN*
  to determine the ip addresses and ports and pids of the java processes
 that
  are listening for tcp socket connections
 
  and the jps command from the bin directory of your java installation to
  determine the pid of the namenode.
 
  On Sat, Feb 7, 2009 at 6:27 AM, shefali pawar  wrote:
 
   Hi,
  
   No, not yet. We are still struggling! If you find the solution please
 let
   me know.
  
   Shefali
  
   On Sat, 07 Feb 2009 02:56:15 +0530  wrote
   I had to change the master on my running cluster and ended up with
 the
   same
   problem. Were you able to fix it at your end?
   
   Amandeep
   
   
   Amandeep Khurana
   Computer Science Graduate Student
   University of California, Santa Cruz
   
   
   On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar wrote:
   
Hi,
   
I do not think that the firewall is blocking the port because it
 has
   been
turned off on both the computers! And also since it is a random
 port
   number
I do not think it should create a problem.
   
I do not understand what is going wrong!
   
Shefali
   
On Wed, 04 Feb 2009 23:23:04 +0530  wrote
I'm not certain that the firewall is your problem but if that port
 is
blocked on your master you should open it to let communication
  through.
Here
is one website that might be relevant:


   
  
 
 http://stackoverflow.com/questions/255077/open-ports-under-fedora-core-8-for-vmware-server

but again, this may not be your problem.

John

On Wed, Feb 4, 2009 at 12:46 PM, shefali pawar wrote:

 Hi,

 I will have to check. I can do that tomorrow in college. But if
  that
   is
the
 case what should i do?

 Should i change the port number and try again?

 Shefali

 On Wed, 04 Feb 2009 S D wrote :

 Shefali,
 
 Is your firewall blocking port 54310 on the master?
 
 John
 
 On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar  wrote:
 
   Hi,
  
   I am trying to set-up a two node cluster using Hadoop0.19.0,
  with
   1
   master(which should also work as a slave) and 1 slave node.
  
   But while running bin/start-dfs.sh the datanode is not
 starting
   on
the
   slave. I had read the previous mails on the list, but
 nothing
   seems
to
 be
   working in this case. I am getting the following error in
 the
   hadoop-root-datanode-slave log file while running the
 command
   bin/start-dfs.sh =
  
   2009-02-03 13:00:27,516 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode:
 STARTUP_MSG:
  
 /
   STARTUP_MSG: Starting DataNode
   STARTUP_MSG:  host = slave/172.16.0.32
   STARTUP_MSG:  args = []
   STARTUP_MSG:  version = 0.19.0
   STARTUP_MSG:  build =
  
   https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19-r
   713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
  
 /
   2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client:
   Retrying
 connect
   to server: 

Re: HDFS on non-identical nodes

2009-02-12 Thread Brian Bockelman


On Feb 12, 2009, at 2:54 AM, Deepak wrote:


Hi,

We're running Hadoop cluster on 4 nodes, our primary purpose of
running is to provide distributed storage solution for internal
applications here in TellyTopia Inc.

Our cluster consists of non-identical nodes (one with 1TB another two
with 3 TB and one more with 60GB) while copying data on HDFS we
noticed that node with 60GB storage ran out of disk-space and even
balancer couldn't balance because cluster was stopped. Now my
questions are

1. Is Hadoop is suitable for non-identical cluster nodes?


Yes.  Our cluster has between 60GB and 40TB on our nodes.  The  
majority have around 3TB.




2. Is there any way to automatically balancing of nodes?


We have a cron script which automatically starts the Balancer.  It's  
dirty, but it works.




3. Why Hadoop cluster stops when one node ran our of disk?



That's not normal.  Trust me, if that was always true, we'd be  
perpetually screwed :)


There might be some other underlying error you're missing...

Brian


Any futher inputs are appericiapted!

Cheers,
Deepak
TellyTopia Inc.




Re: what's going on :( ?

2009-02-12 Thread Mark Kerzner
I see it is picking up other parameters from config, so my hypothesis is
that in 0.19 the file system is listening on 8020. I went back to 18.3 and
also did not change this hdfs port this time, preempting the question, so I
am fine for now.
Mark

On Thu, Feb 12, 2009 at 1:40 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi, Mark

 Try to add an extra property to that file, and try to examine if
 hadoop recognizes it.
 This way you can find out if hadoop uses your configuration file.

 2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
  Hey Mark,
 
  In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020.
  From my understanding of the code, your fs.default.name setting should
  have overridden this port to be 9000. It appears your Hadoop
  installation has not picked up the configuration settings
  appropriately. You might want to see if you have any Hadoop processes
  running and terminate them (bin/stop-all.sh should help) and then
  restart your cluster with the new configuration to see if that helps.
 
  Later,
  Jeff
 
  On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote:
  Mark Kerzner wrote:
 
  Hi,
  Hi,
 
  why is hadoop suddenly telling me
 
   Retrying connect to server: localhost/127.0.0.1:8020
 
  with this configuration
 
  configuration
   property
 namefs.default.name/name
 valuehdfs://localhost:9000/value
   /property
   property
 namemapred.job.tracker/name
 valuelocalhost:9001/value
 
 
  Shouldnt this be
 
  valuehdfs://localhost:9001/value
 
  Amar
 
   /property
   property
 namedfs.replication/name
 value1/value
   /property
  /configuration
 
  and both this http://localhost:50070/dfshealth.jsp and this
  http://localhost:50030/jobtracker.jsp links work fine?
 
  Thank you,
  Mark
 
 
 
 
 



 --
 M. Raşit ÖZDAŞ



Re: HDFS on non-identical nodes

2009-02-12 Thread He Chen
 I think you should confirm your balancer is still running. Do you change
the threshold of the HDFS balancer? May be too large?

The balancer will stop working when meets 5 conditions:

1. Datanodes are balanced (obviously you are not this kind);
2. No more block to be moved (all blocks on unbalanced nodes are busy or
recently used)
3. No more block to be moved in 20 minutes and 5 times consecutive attempts
4. Another balancer is working
5. I/O exception


The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T
is 300GB, and for 60GB is 6GB

Hope helpful


On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman bbock...@cse.unl.eduwrote:


 On Feb 12, 2009, at 2:54 AM, Deepak wrote:

 Hi,

 We're running Hadoop cluster on 4 nodes, our primary purpose of
 running is to provide distributed storage solution for internal
 applications here in TellyTopia Inc.

 Our cluster consists of non-identical nodes (one with 1TB another two
 with 3 TB and one more with 60GB) while copying data on HDFS we
 noticed that node with 60GB storage ran out of disk-space and even
 balancer couldn't balance because cluster was stopped. Now my
 questions are

 1. Is Hadoop is suitable for non-identical cluster nodes?


 Yes.  Our cluster has between 60GB and 40TB on our nodes.  The majority
 have around 3TB.


 2. Is there any way to automatically balancing of nodes?


 We have a cron script which automatically starts the Balancer.  It's dirty,
 but it works.


 3. Why Hadoop cluster stops when one node ran our of disk?


 That's not normal.  Trust me, if that was always true, we'd be perpetually
 screwed :)

 There might be some other underlying error you're missing...

 Brian


 Any futher inputs are appericiapted!

 Cheers,
 Deepak
 TellyTopia Inc.





-- 
Chen He
RCF CSE Dept.
University of Nebraska-Lincoln
US


Re: stable version

2009-02-12 Thread Steve Loughran

Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?





well, in that case something being passed down to setXIncludeAware may 
be picked up as invalid. More of a stack trace may help. Otherwise, now 
is your chance to learn your way around the hadoop codebase, and ensure 
that when the next version ships, your most pressing bugs have been fixed


RE: How to use DBInputFormat?

2009-02-12 Thread Brian MacKay
Amandeep,

I spoke w/ one of our Oracle DBA's and he suggested changing the query 
statement as follows:

MySql Stmt:
select * from TABLE  limit splitlength offset splitstart
---
Oracle Stmt:
select *
  from (select a.*,rownum rno
  from (your_query_here must contain order by) a
where rownum = splitstart + splitlength)
 where rno = splitstart;

This can be put into a function, but would require a type as well.
-

If you edit org.apache.hadoop.mapred.lib.db.DBInputFormat, getSelectQuery, it 
should work in Oracle

protected String getSelectQuery() {

... edit to include check for driver and create Oracle Stmt

  return query.toString();
}


Brian

==
 On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

 
 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a limit and offset parameter appended to the input sql query

 Effectively your input query select * from orders would become
 select * from orders limit splitlength offset splitstart and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana ama...@gmail.com:
   
 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 
 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:
   
 The same query is working if I write a simple JDBC client and query the
 database. So, I'm probably doing something wrong in the connection
 
 settings.
   
 But the error looks to be on the query side more than the connection
 
 side.
   
 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com
 
 wrote:
   
 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

   at

   
 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
   
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
   at

 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
   at
   
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
   
   at LoadTable1.run(LoadTable1.java:130)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at LoadTable1.main(LoadTable1.java:107)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
   
 Source)
   
   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
   at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Exception closing file

   
 /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
   
 java.io.IOException: Filesystem closed
   at
   
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
   
   at
   
 org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
   
   at

   
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
   
   at

   
 

Too many open files in 0.18.3

2009-02-12 Thread Sean Knapp
Hi all,
I'm continually running into the Too many open files error on 18.3:

DataXceiveServer: java.io.IOException: Too many open files

at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)

at
 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)

at
 sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)

at
 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)

at java.lang.Thread.run(Thread.java:619)



I'm writing thousands of files in the course of a few minutes, but nothing
that seems too unreasonable, especially given the numbers below. I begin
getting a surge of these warnings right as I hit 1024 files open by the
DataNode:

had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls
 /proc/{}/fd | wc -l

1023



This is a bit unexpected, however, since I've configured my open file limit
to be 16k:

had...@u10:~$ ulimit -a

core file size  (blocks, -c) 0

data seg size   (kbytes, -d) unlimited

scheduling priority (-e) 0

file size   (blocks, -f) unlimited

pending signals (-i) 268288

max locked memory   (kbytes, -l) 32

max memory size (kbytes, -m) unlimited

open files  (-n) 16384

pipe size(512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority  (-r) 0

stack size  (kbytes, -s) 8192

cpu time   (seconds, -t) unlimited

max user processes  (-u) 268288

virtual memory  (kbytes, -v) unlimited

file locks  (-x) unlimited



Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

Thanks in advance,
Sean


Re: Too many open files in 0.18.3

2009-02-12 Thread Mark Kerzner
I once had too many open files when I was opening too many sockets and not
closing them...

On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp s...@ooyala.com wrote:

 Hi all,
 I'm continually running into the Too many open files error on 18.3:

 DataXceiveServer: java.io.IOException: Too many open files
 
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
 
at
 
 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
 
at
  sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)
 
at
  org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
 
at java.lang.Thread.run(Thread.java:619)
 


 I'm writing thousands of files in the course of a few minutes, but nothing
 that seems too unreasonable, especially given the numbers below. I begin
 getting a surge of these warnings right as I hit 1024 files open by the
 DataNode:

 had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls
  /proc/{}/fd | wc -l
 
 1023
 


 This is a bit unexpected, however, since I've configured my open file limit
 to be 16k:

 had...@u10:~$ ulimit -a
 
 core file size  (blocks, -c) 0
 
 data seg size   (kbytes, -d) unlimited
 
 scheduling priority (-e) 0
 
 file size   (blocks, -f) unlimited
 
 pending signals (-i) 268288
 
 max locked memory   (kbytes, -l) 32
 
 max memory size (kbytes, -m) unlimited
 
 open files  (-n) 16384
 
 pipe size(512 bytes, -p) 8
 
 POSIX message queues (bytes, -q) 819200
 
 real-time priority  (-r) 0
 
 stack size  (kbytes, -s) 8192
 
 cpu time   (seconds, -t) unlimited
 
 max user processes  (-u) 268288
 
 virtual memory  (kbytes, -v) unlimited
 
 file locks  (-x) unlimited
 


 Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

 Thanks in advance,
 Sean



Eclipse plugin

2009-02-12 Thread Iman

Hi,
I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in 
of hadoop. I have followed all the steps in this tutorial: 
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My 
problem is that I am not able to browse the HDFS. It only shows an entry 
Error:null. Upload files to DFS, and Create new directory fail. Any 
suggestions? I have tried to chang all the directories in the hadoop 
location advanced parameters to /tmp/hadoop-user, but it did not work. 
Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to 
be changed, but I could not find it in the list of parameters.

Thanks
Iman


Measuring IO time in map/reduce jobs?

2009-02-12 Thread Bryan Duxbury

Hey all,

Does anyone have any experience trying to measure IO time spent in  
their map/reduce jobs? I know how to profile a sample of map and  
reduce tasks, but that appears to exclude IO time. Just subtracting  
the total cpu time from the total run time of a task seems like too  
coarse an approach.


-Bryan


Re: Eclipse plugin

2009-02-12 Thread Norbert Burger
Are running Eclipse on Windows?  If so, be aware that you need to spawn
Eclipse from within Cygwin in order to access HDFS.  It seems that the
plugin uses whoami to get info about the active user.  This thread has
some more info:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

Norbert

On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:

 Hi,
 I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of
 hadoop. I have followed all the steps in this tutorial:
 http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
 problem is that I am not able to browse the HDFS. It only shows an entry
 Error:null. Upload files to DFS, and Create new directory fail. Any
 suggestions? I have tried to chang all the directories in the hadoop
 location advanced parameters to /tmp/hadoop-user, but it did not work.
 Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be
 changed, but I could not find it in the list of parameters.
 Thanks
 Iman



Hadoop User Group Meeting (Bay Area) 2/18

2009-02-12 Thread Ajay Anand
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday,
February 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building
2, Training Rooms 5  6 from 6:00-7:30 pm.

 

Agenda:

Fair Scheduler for Hadoop - Matei Zaharia

Interfacing with MySQL - Aaron Kimball

 

Registration: http://upcoming.yahoo.com/event/1776616/

 

As always, suggestions for topics for future meetings are welcome.
Please send them to me directly at aan...@yahoo-inc.com

 

Look forward to seeing you there!

Ajay

 



Re: Eclipse plugin

2009-02-12 Thread Iman

Thank you so much, Norbert. It worked.
Iman
Norbert Burger wrote:

Are running Eclipse on Windows?  If so, be aware that you need to spawn
Eclipse from within Cygwin in order to access HDFS.  It seems that the
plugin uses whoami to get info about the active user.  This thread has
some more info:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

Norbert

On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:
  

Hi,
I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of
hadoop. I have followed all the steps in this tutorial:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
problem is that I am not able to browse the HDFS. It only shows an entry
Error:null. Upload files to DFS, and Create new directory fail. Any
suggestions? I have tried to chang all the directories in the hadoop
location advanced parameters to /tmp/hadoop-user, but it did not work.
Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be
changed, but I could not find it in the list of parameters.
Thanks
Iman




  




Re: Too many open files in 0.18.3

2009-02-12 Thread Raghu Angadi


You are most likely hit by 
https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back 
ported. There is a 0.18 patch posted there.


btw, does 16k help in your case?

Ideally 1k should be enough (with small number of clients). Please try 
the above patch with 1k limit.


Raghu.

Sean Knapp wrote:

Hi all,
I'm continually running into the Too many open files error on 18.3:

DataXceiveServer: java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at

sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)


at

sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)


at

org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)


at java.lang.Thread.run(Thread.java:619)


I'm writing thousands of files in the course of a few minutes, but nothing
that seems too unreasonable, especially given the numbers below. I begin
getting a surge of these warnings right as I hit 1024 files open by the
DataNode:

had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls

/proc/{}/fd | wc -l


1023


This is a bit unexpected, however, since I've configured my open file limit
to be 16k:

had...@u10:~$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 268288
max locked memory   (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files  (-n) 16384
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) 268288
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited


Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

Thanks in advance,
Sean





Running RowCounter as Standalone

2009-02-12 Thread Philipp Dobrigkeit
I am trying to run HBase as a Data Source for my Map/Reduce.

I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output 
test... from the command line.

But I would like to run it from my eclipse as well, so I can learn and adapt it 
to my own Map/Reduce needs. I opened a new project, imported hbase.0.19.0.jar, 
hadoop.0.19.0-core.jar (and the commons-logging) and now I am trying to run the 
code of RowCounter. 

But I get the following error message:
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/commons/cli/ParseException
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at hbase.RowCounter.main(RowCounter.java:139)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.cli.ParseException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
... 2 more

Any comments on what I can try to do? 

Best,
Philipp
-- 
Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a


Re: Running RowCounter as Standalone

2009-02-12 Thread Jean-Daniel Cryans
Philipp,

For HBase-related questions, please post to hbase-u...@hadoop.apache.org

Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the
lib folder just to be sure you won't get any other missing class def error.

J-D

On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote:

 I am trying to run HBase as a Data Source for my Map/Reduce.

 I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output
 test... from the command line.

 But I would like to run it from my eclipse as well, so I can learn and
 adapt it to my own Map/Reduce needs. I opened a new project, imported
 hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and now I
 am trying to run the code of RowCounter.

 But I get the following error message:
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/commons/cli/ParseException
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at hbase.RowCounter.main(RowCounter.java:139)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.commons.cli.ParseException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
... 2 more

 Any comments on what I can try to do?

 Best,
 Philipp
 --
 Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL
 für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a



Re: Running RowCounter as Standalone [solved]

2009-02-12 Thread Philipp Dobrigkeit
Thank you and sorry for the mixup with the lists, I was convinced since I call 
./hadoop to run it, that it might be more related to this list.

But I now imported all the .jars in the lib folders of both Hadoop and Hbase 
and now it works.

Best, Philipp

 Original-Nachricht 
 Datum: Thu, 12 Feb 2009 18:54:00 -0500
 Von: Jean-Daniel Cryans jdcry...@apache.org
 An: core-user@hadoop.apache.org, hbase-u...@hadoop.apache.org 
 hbase-u...@hadoop.apache.org
 Betreff: Re: Running RowCounter as Standalone

 Philipp,
 
 For HBase-related questions, please post to hbase-u...@hadoop.apache.org
 
 Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the
 lib folder just to be sure you won't get any other missing class def
 error.
 
 J-D
 
 On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit
 pdobrigk...@gmx.dewrote:
 
  I am trying to run HBase as a Data Source for my Map/Reduce.
 
  I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter
 output
  test... from the command line.
 
  But I would like to run it from my eclipse as well, so I can learn and
  adapt it to my own Map/Reduce needs. I opened a new project, imported
  hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and
 now I
  am trying to run the code of RowCounter.
 
  But I get the following error message:
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/commons/cli/ParseException
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
 at hbase.RowCounter.main(RowCounter.java:139)
  Caused by: java.lang.ClassNotFoundException:
  org.apache.commons.cli.ParseException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
 ... 2 more
 
  Any comments on what I can try to do?
 
  Best,
  Philipp

-- 
Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a


Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha
hello,
I have a 42 GB file on the local fs(call the machine A)  which i need
to copy to a HDFS (replicattion 1), according the HDFS webtracker it
has 208GB across 7 machines.
Note, the machine A has about 80 GB total, so there is no place to
store copies of the file.
Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
with all blocks being bad. This is not surprising since the file is
copied entirely to the HDFS region that resides on A. Had the file
been copied across all machines, this would not have failed.

I have more experience with mapreduce and not much with the hdfs side
of things.
Is there a configuration option i'm missing that forces the file to be
split across the machines(when it is being copied)?
-- 
Saptarshi Guha - saptarshi.g...@gmail.com


Re: Too many open files in 0.18.3

2009-02-12 Thread Sean Knapp
Raghu,
Thanks for the quick response. I've been beating up on the cluster for a
while now and so far so good. I'm still at 8k... what should I expect to
find with 16k versus 1k? The 8k didn't appear to be affecting things to
begin with.

Regards,
Sean

On Thu, Feb 12, 2009 at 2:07 PM, Raghu Angadi rang...@yahoo-inc.com wrote:


 You are most likely hit by
 https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back
 ported. There is a 0.18 patch posted there.

 btw, does 16k help in your case?

 Ideally 1k should be enough (with small number of clients). Please try the
 above patch with 1k limit.

 Raghu.


 Sean Knapp wrote:

 Hi all,
 I'm continually running into the Too many open files error on 18.3:

 DataXceiveServer: java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at


 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)

 at

 sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)

 at

 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)

 at java.lang.Thread.run(Thread.java:619)


 I'm writing thousands of files in the course of a few minutes, but nothing
 that seems too unreasonable, especially given the numbers below. I begin
 getting a surge of these warnings right as I hit 1024 files open by the
 DataNode:

 had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls

 /proc/{}/fd | wc -l

  1023


 This is a bit unexpected, however, since I've configured my open file
 limit
 to be 16k:

 had...@u10:~$ ulimit -a
 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 268288
 max locked memory   (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files  (-n) 16384
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) 8192
 cpu time   (seconds, -t) unlimited
 max user processes  (-u) 268288
 virtual memory  (kbytes, -v) unlimited
 file locks  (-x) unlimited


 Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

 Thanks in advance,
 Sean





Re: Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread TCK

Did you run the copy command from machine A? I believe that if you do the copy 
from an hdfs client that is on the same machine as a data node, then for each 
block the primary copy always goes to that data node, and only the replicas get 
distributed among other data nodes. I ran into this issue once -- I had to have 
the client doing the copy either on the master or on an off-cluster node.
-TCK



--- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote:
From: Saptarshi Guha saptarshi.g...@gmail.com
Subject: Very large file copied to cluster, and the copy fails. All blocks bad
To: core-user@hadoop.apache.org core-user@hadoop.apache.org
Date: Thursday, February 12, 2009, 9:50 PM

hello,
I have a 42 GB file on the local fs(call the machine A)  which i need
to copy to a HDFS (replicattion 1), according the HDFS webtracker it
has 208GB across 7 machines.
Note, the machine A has about 80 GB total, so there is no place to
store copies of the file.
Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
with all blocks being bad. This is not surprising since the file is
copied entirely to the HDFS region that resides on A. Had the file
been copied across all machines, this would not have failed.

I have more experience with mapreduce and not much with the hdfs side
of things.
Is there a configuration option i'm missing that forces the file to be
split across the machines(when it is being copied)?
-- 
Saptarshi Guha - saptarshi.g...@gmail.com



  

Re: Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha
 Did you run the copy command from machine A?
Yes, exactly.
 I had to have the client doing the copy either on the master or on an 
 off-cluster
 Thanks! I uploaded it from an off cluster (i.e not participating in
the hdfs) and it worked splendidly.

Regards
Saptarshi


On Thu, Feb 12, 2009 at 11:03 PM, TCK moonwatcher32...@yahoo.com wrote:

 I believe that if you do the copy from an hdfs client that is on the
same machine as a data node, then for each block the primary copy
always goes to that data node, and only the replicas get distributed
among other data nodes. I ran into this issue once -- I had to have
the client doing the copy either on the master or on an off-cluster
node.
 -TCK



 --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 From: Saptarshi Guha saptarshi.g...@gmail.com
 Subject: Very large file copied to cluster, and the copy fails. All blocks bad
 To: core-user@hadoop.apache.org core-user@hadoop.apache.org
 Date: Thursday, February 12, 2009, 9:50 PM

 hello,
 I have a 42 GB file on the local fs(call the machine A)  which i need
 to copy to a HDFS (replicattion 1), according the HDFS webtracker it
 has 208GB across 7 machines.
 Note, the machine A has about 80 GB total, so there is no place to
 store copies of the file.
 Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
 with all blocks being bad. This is not surprising since the file is
 copied entirely to the HDFS region that resides on A. Had the file
 been copied across all machines, this would not have failed.

 I have more experience with mapreduce and not much with the hdfs side
 of things.
 Is there a configuration option i'm missing that forces the file to be
 split across the machines(when it is being copied)?
 --
 Saptarshi Guha - saptarshi.g...@gmail.com







-- 
Saptarshi Guha - saptarshi.g...@gmail.com