date:20090212

Re: Best practices on spliltting an input line?

2009-02-12 Thread Rasit OZDAS

Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins andy.saut...@returnpath.net:


   I have question.  I've dabbled with different ways of tokenizing an
 input file line for processing.  I've noticed in my somewhat limited
 tests that there seem to be some pretty reasonable performance
 differences between different tokenizing methods.  For example, roughly
 it seems to split a line on tokens ( tab delimited in my case ) that
 Scanner is the slowest, followed by String.spit and StringTokenizer
 being the fastest.  StringTokenizer, for my application, has the
 unfortunate characteristic of not returning blank tokens ( i.e., parsing
 a,b,c,,d would return a,b,c,d instead of a,b,c,,d).
 The WordCount example uses StringTokenizer which makes sense to me,
 except I'm currently getting hung up on not returning blank tokens.  I
 did run across the com.Ostermiller.util StringTokenizer replacement that
 handles null/blank tokens
 (http://ostermiller.org/utils/StringTokenizer.html ) which seems
 possible to use, but it sure seems like someone else has solved this
 problem already better than I have.



   So, my question is, is there a best practice for splitting an input
 line especially when NULL tokens are expected ( i.e., two consecutive
 delimiter characters )?



   Any thoughts would be appreciated



   Thanks



   Andy





-- 
M. Raşit ÖZDAŞ

HDFS on non-identical nodes

2009-02-12 Thread Deepak

Hi,

We're running Hadoop cluster on 4 nodes, our primary purpose of
running is to provide distributed storage solution for internal
applications here in TellyTopia Inc.

Our cluster consists of non-identical nodes (one with 1TB another two
with 3 TB and one more with 60GB) while copying data on HDFS we
noticed that node with 60GB storage ran out of disk-space and even
balancer couldn't balance because cluster was stopped. Now my
questions are

1. Is Hadoop is suitable for non-identical cluster nodes?
2. Is there any way to automatically balancing of nodes?
3. Why Hadoop cluster stops when one node ran our of disk?

Any futher inputs are appericiapted!

Cheers,
Deepak
TellyTopia Inc.

Re: stable version

2009-02-12 Thread Anum Ali

Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.

It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.

Thanks.

2009/2/12 Raghu Angadi rang...@yahoo-inc.com

Vadim Zaliva wrote:

The particular problem I am having is this one:

https://issues.apache.org/jira/browse/HADOOP-2669

I am observing it in version 19. Could anybody confirm that
it have been fixed in 18, as Jira claims?

I am wondering why bug fix for this problem might have been committed
to 18 branch but not 19. If it was commited to both, then perhaps the
problem was not completely solved and downgrading to 18 will not help
me.

If you read through the comments, it will see that the the root cause was
never found. The patch just fixes one of the suspects. If you are still
seeing this, please file another jira and link it HADOOP-2669.

How easy is it for you reproduce this? I guess one of the reasons for
incomplete diagnosis is that it is not simple to reproduce.

Raghu.

Vadim

On Wed, Feb 11, 2009 at 00:48, Rasit OZDAS rasitoz...@gmail.com wrote:

Yes, version 18.3 is the most stable one. It has added patches,
without not-proven new functionality.

2009/2/11 Owen O'Malley omal...@apache.org:

On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:

Maybe version 0.18
is better suited for production environment?

Yahoo is mostly on 0.18.3 + some patches at this point.

-- Owen

--
M. Raşit ÖZDAŞ

Re: Backing up HDFS?

2009-02-12 Thread Stefan Podkowinski

On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote:

 The key here is to prioritize your data.  Impossible to replicate data gets
 backed up using whatever means necessary, hard-to-regenerate data, next
 priority. Easy to regenerate and ok to nuke data, doesn't get backed up.


I think thats a good advise to start with when creating a backup strategy.
E.g. what we do at the moment is to analyze huge volumes of access
logs where we import those logs into hdfs, creating aggregates for
several metrics and finally storing results in sequence files using
block level compression. Its kind of an intermediate format that can
be used for further analysis. Those files end up being pretty small
and will be exported daily to storage and getting backuped. In case
hdfs goes to hell we can restore some raw log data from the servers
and only loose historical logs which should not be a big deal.

I must also add that I really enjoy the great deal of optimization
opportunities that hadoop gives you by directly implementing the
serialization strategies. You really get control over every bit and
byte that gets recorded. Same with compression. So you can make the
best trade offs possible and finally store only data you really need.

Re: stable version

2009-02-12 Thread Steve Loughran


Anum Ali wrote:

Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.




You are using Java6, correct?

Re: Best practices on spliltting an input line?

2009-02-12 Thread Steve Loughran


Stefan Podkowinski wrote:

I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. a,a,b,c.


I've used it in the past; found it pretty reliable. Again, no perf 
tests, just reading in CSV files exported from spreadsheets

how to use FileSystem Object API

2009-02-12 Thread Ved Prakash

I m trying to run following code

public class localtohdfs {
public static void main()
{
  Configuration config = new Configuration();
  FileSystem hdfs = FileSystem.get(config);
  Path srcPath = new Path(/root/testfile);
  Path dstPath = new Path(testfile_hadoop);
  hdfs.copyFromLocalFile(srcPath, dstPath);
}
}

when I do javac i see
[r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java
localtohdfs.java:3: package org.apache.hadoop does not exist
import org.apache.hadoop.*;
^
localtohdfs.java:7: cannot find symbol
symbol  : class Configuration
location: class localtohdfs
  Configuration config = new Configuration();
  ^
localtohdfs.java:7: cannot find symbol
symbol  : class Configuration
location: class localtohdfs
  Configuration config = new Configuration();
 ^
localtohdfs.java:8: cannot find symbol
symbol  : class FileSystem
location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
  ^
localtohdfs.java:8: cannot find symbol
symbol  : variable FileSystem
location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
^
localtohdfs.java:9: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
  ^
localtohdfs.java:9: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
 ^
localtohdfs.java:10: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
  ^
localtohdfs.java:10: cannot find symbol
symbol  : class Path
location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
 ^
9 errors

CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib
PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1
JAVA_HOME=/usr/java/jdk1.6.0/

Where is the problem?

thanks and regards!

Re: how to use FileSystem Object API

2009-02-12 Thread ChingShen

you have to specify a hadoop-xxx-core.jar file.

for example:
javac -cp hadoop-xxx-core.jar localtohdfs.java


On Thu, Feb 12, 2009 at 7:58 PM, Ved Prakash meramailingl...@gmail.comwrote:

 I m trying to run following code

 public class localtohdfs {
 public static void main()
 {
  Configuration config = new Configuration();
  FileSystem hdfs = FileSystem.get(config);
  Path srcPath = new Path(/root/testfile);
  Path dstPath = new Path(testfile_hadoop);
  hdfs.copyFromLocalFile(srcPath, dstPath);
 }
 }

 when I do javac i see
 [r...@nlb-2 hadoop-0.17.2.1]# javac localtohdfs.java
 localtohdfs.java:3: package org.apache.hadoop does not exist
 import org.apache.hadoop.*;
 ^
 localtohdfs.java:7: cannot find symbol
 symbol  : class Configuration
 location: class localtohdfs
  Configuration config = new Configuration();
  ^
 localtohdfs.java:7: cannot find symbol
 symbol  : class Configuration
 location: class localtohdfs
  Configuration config = new Configuration();
 ^
 localtohdfs.java:8: cannot find symbol
 symbol  : class FileSystem
 location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
  ^
 localtohdfs.java:8: cannot find symbol
 symbol  : variable FileSystem
 location: class localtohdfs
  FileSystem hdfs = FileSystem.get(config);
^
 localtohdfs.java:9: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
  ^
 localtohdfs.java:9: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path srcPath = new Path(/root/testfile);
 ^
 localtohdfs.java:10: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
  ^
 localtohdfs.java:10: cannot find symbol
 symbol  : class Path
 location: class localtohdfs
  Path dstPath = new Path(testfile_hadoop);
 ^
 9 errors


 CLASS_PATH=/root/newhadoop/hadoop-0.17.2.1/src/java:/root/newhadoop/hadoop-0.17.2.1/conf:/usr/java/jdk1.6.0/lib

 PATH=/usr/java/jdk1.6.0/bin:/usr/local/apache-ant-1.7.1/bin:/root/newhadoop/hadoop-0.17.2.1/bin:/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
 HADOOP_HOME=/root/newhadoop/hadoop-0.17.2.1
 JAVA_HOME=/usr/java/jdk1.6.0/

 Where is the problem?

 thanks and regards!

Re: stable version

2009-02-12 Thread Anum Ali

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:

 Anum Ali wrote:

 Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
 regarding running its examples/file from eclipse.


 It gives error for

 Exception in thread main java.lang.UnsupportedOperationException: This
 parser does not support specification null version null  at

 javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

 Can anyone reslove or give some idea about it.



 You are using Java6, correct?

Re: Finding small subset in very large dataset

2009-02-12 Thread Thibaut_


Thanks,

I didn't think about the bloom filter variant. That's the solution I was
looking for :-)

Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

2009-02-12 Thread Miles Osborne

Bloom Filters are one of the greatest things ever, so it is nice to
see another application.

Remember that your filter may make mistakes  -you will see items that
are not in the set.  Also, instead of setting a single bit per item
(in the A set), set k distinct bits.

You can analytically work-out the best k for a given number of items
and for some amount of memory.  In practice, this usually boils-down
to k being 3 or so for a reasonable error rate.

Happy hunting

Miles

2009/2/12 Thibaut_ tbr...@blue.lu:

 Thanks,

 I didn't think about the bloom filter variant. That's the solution I was
 looking for :-)

 Thibaut
 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Re: Re: Re: Re: Re: Regarding Hadoop multi cluster set-up

2009-02-12 Thread shefali pawar

I changed the value... It is still not working!

Shefali

On Tue, 10 Feb 2009 22:23:10 +0530 wrote
in hadoop-site.xml
change master:54311

to hdfs://master:54311

--nitesh

On Tue, Feb 10, 2009 at 9:50 PM, shefali pawar wrote:

I tried that, but it is not working either!

Shefali

On Sun, 08 Feb 2009 05:27:54 +0530 wrote
I ran into this trouble again. This time, formatting the namenode didnt
help. So, I changed the directories where the metadata and the data was
being stored. That made it work.

You might want to check this up at your end too.

Amandeep

PS: I dont have an explanation for how and why this made it work.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Sat, Feb 7, 2009 at 9:06 AM, jason hadoop wrote:

On your master machine, use the netstat command to determine what ports
and
addresses the namenode process is listening on.

On the datanode machines, examine the log files,, to verify that the
datanode has attempted to connect to the namenode ip address on one of
those
ports, and was successfull.

The common ports used for datanode - namenode rondevu are 50010, 54320
and
8020, depending on your hadoop version

If the datanodes have been started, and the connection to the namenode
failed, there will be a log message with a socket error, indicating what
host and port the datanode used to attempt to communicate with the
namenode.
Verify that that ip address is correct for your namenode, and reachable
from
the datanode host (for multi homed machines this can be an issue), and
that
the port listed is one of the tcp ports that the namenode process is
listing
on.

For linux, you can use command
*netstat -a -t -n -p | grep java | grep LISTEN*
to determine the ip addresses and ports and pids of the java processes
that
are listening for tcp socket connections

and the jps command from the bin directory of your java installation to
determine the pid of the namenode.

On Sat, Feb 7, 2009 at 6:27 AM, shefali pawar wrote:

Hi,

No, not yet. We are still struggling! If you find the solution please
let
me know.

Shefali

On Sat, 07 Feb 2009 02:56:15 +0530 wrote
I had to change the master on my running cluster and ended up with
the
same
problem. Were you able to fix it at your end?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Thu, Feb 5, 2009 at 8:46 AM, shefali pawar wrote:

Hi,

I do not think that the firewall is blocking the port because it
has
been
turned off on both the computers! And also since it is a random
port
number
I do not think it should create a problem.

I do not understand what is going wrong!

Shefali

On Wed, 04 Feb 2009 23:23:04 +0530 wrote
I'm not certain that the firewall is your problem but if that port
is
blocked on your master you should open it to let communication
through.
Here
is one website that might be relevant:

http://stackoverflow.com/questions/255077/open-ports-under-fedora-core-8-for-vmware-server

but again, this may not be your problem.

John

On Wed, Feb 4, 2009 at 12:46 PM, shefali pawar wrote:

Hi,

I will have to check. I can do that tomorrow in college. But if
that
is
the
case what should i do?

Should i change the port number and try again?

Shefali

On Wed, 04 Feb 2009 S D wrote :

Shefali,

Is your firewall blocking port 54310 on the master?

John

On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar wrote:

Hi,

I am trying to set-up a two node cluster using Hadoop0.19.0,
with
1
master(which should also work as a slave) and 1 slave node.

But while running bin/start-dfs.sh the datanode is not
starting
on
the
slave. I had read the previous mails on the list, but
nothing
seems
to
be
working in this case. I am getting the following error in
the
hadoop-root-datanode-slave log file while running the
command
bin/start-dfs.sh =

2009-02-03 13:00:27,516 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode:
STARTUP_MSG:

/
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = slave/172.16.0.32
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.19.0
STARTUP_MSG: build =

https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19-r
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

/
2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client:
Retrying
connect
to server:

Re: HDFS on non-identical nodes

2009-02-12 Thread Brian Bockelman



On Feb 12, 2009, at 2:54 AM, Deepak wrote:


Hi,

We're running Hadoop cluster on 4 nodes, our primary purpose of
running is to provide distributed storage solution for internal
applications here in TellyTopia Inc.

Our cluster consists of non-identical nodes (one with 1TB another two
with 3 TB and one more with 60GB) while copying data on HDFS we
noticed that node with 60GB storage ran out of disk-space and even
balancer couldn't balance because cluster was stopped. Now my
questions are

1. Is Hadoop is suitable for non-identical cluster nodes?


Yes.  Our cluster has between 60GB and 40TB on our nodes.  The  
majority have around 3TB.




2. Is there any way to automatically balancing of nodes?


We have a cron script which automatically starts the Balancer.  It's  
dirty, but it works.




3. Why Hadoop cluster stops when one node ran our of disk?



That's not normal.  Trust me, if that was always true, we'd be  
perpetually screwed :)


There might be some other underlying error you're missing...

Brian


Any futher inputs are appericiapted!

Cheers,
Deepak
TellyTopia Inc.

Re: what's going on :( ?

2009-02-12 Thread Mark Kerzner

I see it is picking up other parameters from config, so my hypothesis is
that in 0.19 the file system is listening on 8020. I went back to 18.3 and
also did not change this hdfs port this time, preempting the question, so I
am fine for now.
Mark

On Thu, Feb 12, 2009 at 1:40 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi, Mark

 Try to add an extra property to that file, and try to examine if
 hadoop recognizes it.
 This way you can find out if hadoop uses your configuration file.

 2009/2/10 Jeff Hammerbacher ham...@cloudera.com:
  Hey Mark,
 
  In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020.
  From my understanding of the code, your fs.default.name setting should
  have overridden this port to be 9000. It appears your Hadoop
  installation has not picked up the configuration settings
  appropriately. You might want to see if you have any Hadoop processes
  running and terminate them (bin/stop-all.sh should help) and then
  restart your cluster with the new configuration to see if that helps.
 
  Later,
  Jeff
 
  On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat ama...@yahoo-inc.com wrote:
  Mark Kerzner wrote:
 
  Hi,
  Hi,
 
  why is hadoop suddenly telling me
 
   Retrying connect to server: localhost/127.0.0.1:8020
 
  with this configuration
 
  configuration
   property
 namefs.default.name/name
 valuehdfs://localhost:9000/value
   /property
   property
 namemapred.job.tracker/name
 valuelocalhost:9001/value
 
 
  Shouldnt this be
 
  valuehdfs://localhost:9001/value
 
  Amar
 
   /property
   property
 namedfs.replication/name
 value1/value
   /property
  /configuration
 
  and both this http://localhost:50070/dfshealth.jsp and this
  http://localhost:50030/jobtracker.jsp links work fine?
 
  Thank you,
  Mark
 
 
 
 
 



 --
 M. Raşit ÖZDAŞ

Re: HDFS on non-identical nodes

2009-02-12 Thread He Chen

 I think you should confirm your balancer is still running. Do you change
the threshold of the HDFS balancer? May be too large?

The balancer will stop working when meets 5 conditions:

1. Datanodes are balanced (obviously you are not this kind);
2. No more block to be moved (all blocks on unbalanced nodes are busy or
recently used)
3. No more block to be moved in 20 minutes and 5 times consecutive attempts
4. Another balancer is working
5. I/O exception


The default setting is 10% for each datanodes, for 1TB it is 100GB, for 3T
is 300GB, and for 60GB is 6GB

Hope helpful


On Thu, Feb 12, 2009 at 10:06 AM, Brian Bockelman bbock...@cse.unl.eduwrote:


 On Feb 12, 2009, at 2:54 AM, Deepak wrote:

 Hi,

 We're running Hadoop cluster on 4 nodes, our primary purpose of
 running is to provide distributed storage solution for internal
 applications here in TellyTopia Inc.

 Our cluster consists of non-identical nodes (one with 1TB another two
 with 3 TB and one more with 60GB) while copying data on HDFS we
 noticed that node with 60GB storage ran out of disk-space and even
 balancer couldn't balance because cluster was stopped. Now my
 questions are

 1. Is Hadoop is suitable for non-identical cluster nodes?


 Yes.  Our cluster has between 60GB and 40TB on our nodes.  The majority
 have around 3TB.


 2. Is there any way to automatically balancing of nodes?


 We have a cron script which automatically starts the Balancer.  It's dirty,
 but it works.


 3. Why Hadoop cluster stops when one node ran our of disk?


 That's not normal.  Trust me, if that was always true, we'd be perpetually
 screwed :)

 There might be some other underlying error you're missing...

 Brian


 Any futher inputs are appericiapted!

 Cheers,
 Deepak
 TellyTopia Inc.





-- 
Chen He
RCF CSE Dept.
University of Nebraska-Lincoln
US

Re: stable version

2009-02-12 Thread Steve Loughran


Anum Ali wrote:

yes


On Thu, Feb 12, 2009 at 4:33 PM, Steve Loughran ste...@apache.org wrote:


Anum Ali wrote:


Iam working on Hadoop SVN version 0.21.0-dev. Having some problems ,
regarding running its examples/file from eclipse.


It gives error for

Exception in thread main java.lang.UnsupportedOperationException: This
parser does not support specification null version null  at

javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)

Can anyone reslove or give some idea about it.



You are using Java6, correct?





well, in that case something being passed down to setXIncludeAware may 
be picked up as invalid. More of a stack trace may help. Otherwise, now 
is your chance to learn your way around the hadoop codebase, and ensure 
that when the next version ships, your most pressing bugs have been fixed

RE: How to use DBInputFormat?

2009-02-12 Thread Brian MacKay

Amandeep,

I spoke w/ one of our Oracle DBA's and he suggested changing the query 
statement as follows:

MySql Stmt:
select * from TABLE  limit splitlength offset splitstart
---
Oracle Stmt:
select *
  from (select a.*,rownum rno
  from (your_query_here must contain order by) a
where rownum = splitstart + splitlength)
 where rno = splitstart;

This can be put into a function, but would require a type as well.
-

If you edit org.apache.hadoop.mapred.lib.db.DBInputFormat, getSelectQuery, it 
should work in Oracle

protected String getSelectQuery() {

... edit to include check for driver and create Oracle Stmt

  return query.toString();
}


Brian

==
 On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:

 
 The 0.19 DBInputFormat class implementation is IMHO only suitable for
 very simple queries working on only few datasets. Thats due to the
 fact that it tries to create splits from the query by
 1) getting a count of all rows using the specified count query (huge
 performance impact on large tables)
 2) creating splits by issuing an individual query for each split with
 a limit and offset parameter appended to the input sql query

 Effectively your input query select * from orders would become
 select * from orders limit splitlength offset splitstart and
 executed until count has been reached. I guess this is not working sql
 syntax for oracle.

 Stefan


 2009/2/4 Amandeep Khurana ama...@gmail.com:
   
 Adding a semicolon gives me the error ORA-00911: Invalid character

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 
 Amandeep,
 SQL command not properly ended
 I get this error whenever I forget the semicolon at the end.
 I know, it doesn't make sense, but I recommend giving it a try

 Rasit

 2009/2/4 Amandeep Khurana ama...@gmail.com:
   
 The same query is working if I write a simple JDBC client and query the
 database. So, I'm probably doing something wrong in the connection
 
 settings.
   
 But the error looks to be on the query side more than the connection
 
 side.
   
 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com
 
 wrote:
   
 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

   at

   
 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
   
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
   at

 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
   at
   
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
   
   at LoadTable1.run(LoadTable1.java:130)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at LoadTable1.main(LoadTable1.java:107)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
   
 Source)
   
   at java.lang.reflect.Method.invoke(Unknown Source)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
   at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Exception closing file

   
 /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
   
 java.io.IOException: Filesystem closed
   at
   
 org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
   
   at
   
 org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
   
   at

   
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
   
   at

Too many open files in 0.18.3

2009-02-12 Thread Sean Knapp

Hi all,
I'm continually running into the Too many open files error on 18.3:

DataXceiveServer: java.io.IOException: Too many open files

at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)

at
 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)

at
 sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)

at
 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)

at java.lang.Thread.run(Thread.java:619)



I'm writing thousands of files in the course of a few minutes, but nothing
that seems too unreasonable, especially given the numbers below. I begin
getting a surge of these warnings right as I hit 1024 files open by the
DataNode:

had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls
 /proc/{}/fd | wc -l

1023



This is a bit unexpected, however, since I've configured my open file limit
to be 16k:

had...@u10:~$ ulimit -a

core file size  (blocks, -c) 0

data seg size   (kbytes, -d) unlimited

scheduling priority (-e) 0

file size   (blocks, -f) unlimited

pending signals (-i) 268288

max locked memory   (kbytes, -l) 32

max memory size (kbytes, -m) unlimited

open files  (-n) 16384

pipe size(512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority  (-r) 0

stack size  (kbytes, -s) 8192

cpu time   (seconds, -t) unlimited

max user processes  (-u) 268288

virtual memory  (kbytes, -v) unlimited

file locks  (-x) unlimited



Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

Thanks in advance,
Sean

Re: Too many open files in 0.18.3

2009-02-12 Thread Mark Kerzner

I once had too many open files when I was opening too many sockets and not
closing them...

On Thu, Feb 12, 2009 at 1:56 PM, Sean Knapp s...@ooyala.com wrote:

 Hi all,
 I'm continually running into the Too many open files error on 18.3:

 DataXceiveServer: java.io.IOException: Too many open files
 
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
 
at
 
 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)
 
at
  sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)
 
at
  org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
 
at java.lang.Thread.run(Thread.java:619)
 


 I'm writing thousands of files in the course of a few minutes, but nothing
 that seems too unreasonable, especially given the numbers below. I begin
 getting a surge of these warnings right as I hit 1024 files open by the
 DataNode:

 had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls
  /proc/{}/fd | wc -l
 
 1023
 


 This is a bit unexpected, however, since I've configured my open file limit
 to be 16k:

 had...@u10:~$ ulimit -a
 
 core file size  (blocks, -c) 0
 
 data seg size   (kbytes, -d) unlimited
 
 scheduling priority (-e) 0
 
 file size   (blocks, -f) unlimited
 
 pending signals (-i) 268288
 
 max locked memory   (kbytes, -l) 32
 
 max memory size (kbytes, -m) unlimited
 
 open files  (-n) 16384
 
 pipe size(512 bytes, -p) 8
 
 POSIX message queues (bytes, -q) 819200
 
 real-time priority  (-r) 0
 
 stack size  (kbytes, -s) 8192
 
 cpu time   (seconds, -t) unlimited
 
 max user processes  (-u) 268288
 
 virtual memory  (kbytes, -v) unlimited
 
 file locks  (-x) unlimited
 


 Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

 Thanks in advance,
 Sean

Eclipse plugin

2009-02-12 Thread Iman


Hi,
I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in 
of hadoop. I have followed all the steps in this tutorial: 
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My 
problem is that I am not able to browse the HDFS. It only shows an entry 
Error:null. Upload files to DFS, and Create new directory fail. Any 
suggestions? I have tried to chang all the directories in the hadoop 
location advanced parameters to /tmp/hadoop-user, but it did not work. 
Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to 
be changed, but I could not find it in the list of parameters.

Thanks
Iman

Measuring IO time in map/reduce jobs?

2009-02-12 Thread Bryan Duxbury


Hey all,

Does anyone have any experience trying to measure IO time spent in  
their map/reduce jobs? I know how to profile a sample of map and  
reduce tasks, but that appears to exclude IO time. Just subtracting  
the total cpu time from the total run time of a task seems like too  
coarse an approach.


-Bryan

Re: Eclipse plugin

2009-02-12 Thread Norbert Burger

Are running Eclipse on Windows?  If so, be aware that you need to spawn
Eclipse from within Cygwin in order to access HDFS.  It seems that the
plugin uses whoami to get info about the active user.  This thread has
some more info:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

Norbert

On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:

 Hi,
 I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of
 hadoop. I have followed all the steps in this tutorial:
 http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
 problem is that I am not able to browse the HDFS. It only shows an entry
 Error:null. Upload files to DFS, and Create new directory fail. Any
 suggestions? I have tried to chang all the directories in the hadoop
 location advanced parameters to /tmp/hadoop-user, but it did not work.
 Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be
 changed, but I could not find it in the list of parameters.
 Thanks
 Iman

Hadoop User Group Meeting (Bay Area) 2/18

2009-02-12 Thread Ajay Anand

The next Bay Area Hadoop User Group meeting is scheduled for Wednesday,
February 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building
2, Training Rooms 5  6 from 6:00-7:30 pm.

 

Agenda:

Fair Scheduler for Hadoop - Matei Zaharia

Interfacing with MySQL - Aaron Kimball

 

Registration: http://upcoming.yahoo.com/event/1776616/

 

As always, suggestions for topics for future meetings are welcome.
Please send them to me directly at aan...@yahoo-inc.com

 

Look forward to seeing you there!

Ajay

Re: Eclipse plugin

2009-02-12 Thread Iman


Thank you so much, Norbert. It worked.
Iman
Norbert Burger wrote:

Are running Eclipse on Windows?  If so, be aware that you need to spawn
Eclipse from within Cygwin in order to access HDFS.  It seems that the
plugin uses whoami to get info about the active user.  This thread has
some more info:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200807.mbox/%3c487cd747.8050...@signal7.de%3e

Norbert

On 2/12/09, Iman ielgh...@cs.uwaterloo.ca wrote:
  

Hi,
I am using VM image hadoop-appliance-0.18.0.vmx and an eclipse plug-in of
hadoop. I have followed all the steps in this tutorial:
http://public.yahoo.com/gogate/hadoop-tutorial/html/module3.html. My
problem is that I am not able to browse the HDFS. It only shows an entry
Error:null. Upload files to DFS, and Create new directory fail. Any
suggestions? I have tried to chang all the directories in the hadoop
location advanced parameters to /tmp/hadoop-user, but it did not work.
Also, the tutorials mentioned a parameter hadoop.job.ugi that needs to be
changed, but I could not find it in the list of parameters.
Thanks
Iman

Re: Too many open files in 0.18.3

2009-02-12 Thread Raghu Angadi



You are most likely hit by 
https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back 
ported. There is a 0.18 patch posted there.


btw, does 16k help in your case?

Ideally 1k should be enough (with small number of clients). Please try 
the above patch with 1k limit.


Raghu.

Sean Knapp wrote:

Hi all,
I'm continually running into the Too many open files error on 18.3:

DataXceiveServer: java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at

sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)


at

sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)


at

org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)


at java.lang.Thread.run(Thread.java:619)


I'm writing thousands of files in the course of a few minutes, but nothing
that seems too unreasonable, especially given the numbers below. I begin
getting a surge of these warnings right as I hit 1024 files open by the
DataNode:

had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls

/proc/{}/fd | wc -l


1023


This is a bit unexpected, however, since I've configured my open file limit
to be 16k:

had...@u10:~$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 268288
max locked memory   (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files  (-n) 16384
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) 268288
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited


Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

Thanks in advance,
Sean

Running RowCounter as Standalone

2009-02-12 Thread Philipp Dobrigkeit

I am trying to run HBase as a Data Source for my Map/Reduce.

I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output 
test... from the command line.

But I would like to run it from my eclipse as well, so I can learn and adapt it 
to my own Map/Reduce needs. I opened a new project, imported hbase.0.19.0.jar, 
hadoop.0.19.0-core.jar (and the commons-logging) and now I am trying to run the 
code of RowCounter. 

But I get the following error message:
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/commons/cli/ParseException
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at hbase.RowCounter.main(RowCounter.java:139)
Caused by: java.lang.ClassNotFoundException: 
org.apache.commons.cli.ParseException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
... 2 more

Any comments on what I can try to do? 

Best,
Philipp
-- 
Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a

Re: Running RowCounter as Standalone

2009-02-12 Thread Jean-Daniel Cryans

Philipp,

For HBase-related questions, please post to hbase-u...@hadoop.apache.org

Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the
lib folder just to be sure you won't get any other missing class def error.

J-D

On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote:

 I am trying to run HBase as a Data Source for my Map/Reduce.

 I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter output
 test... from the command line.

 But I would like to run it from my eclipse as well, so I can learn and
 adapt it to my own Map/Reduce needs. I opened a new project, imported
 hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and now I
 am trying to run the code of RowCounter.

 But I get the following error message:
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/commons/cli/ParseException
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
at hbase.RowCounter.main(RowCounter.java:139)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.commons.cli.ParseException
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
... 2 more

 Any comments on what I can try to do?

 Best,
 Philipp
 --
 Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL
 für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a

Re: Running RowCounter as Standalone [solved]

2009-02-12 Thread Philipp Dobrigkeit

Thank you and sorry for the mixup with the lists, I was convinced since I call 
./hadoop to run it, that it might be more related to this list.

But I now imported all the .jars in the lib folders of both Hadoop and Hbase 
and now it works.

Best, Philipp

 Original-Nachricht 
 Datum: Thu, 12 Feb 2009 18:54:00 -0500
 Von: Jean-Daniel Cryans jdcry...@apache.org
 An: core-user@hadoop.apache.org, hbase-u...@hadoop.apache.org 
 hbase-u...@hadoop.apache.org
 Betreff: Re: Running RowCounter as Standalone

 Philipp,
 
 For HBase-related questions, please post to hbase-u...@hadoop.apache.org
 
 Try importing commons-cli-2.0-SNAPSHOT.jar as well as any other jar in the
 lib folder just to be sure you won't get any other missing class def
 error.
 
 J-D
 
 On Thu, Feb 12, 2009 at 6:32 PM, Philipp Dobrigkeit
 pdobrigk...@gmx.dewrote:
 
  I am trying to run HBase as a Data Source for my Map/Reduce.
 
  I was able to use the ./bin/hadoop jar hbase.0.19.0.jar rowcounter
 output
  test... from the command line.
 
  But I would like to run it from my eclipse as well, so I can learn and
  adapt it to my own Map/Reduce needs. I opened a new project, imported
  hbase.0.19.0.jar, hadoop.0.19.0-core.jar (and the commons-logging) and
 now I
  am trying to run the code of RowCounter.
 
  But I get the following error message:
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/commons/cli/ParseException
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:59)
 at hbase.RowCounter.main(RowCounter.java:139)
  Caused by: java.lang.ClassNotFoundException:
  org.apache.commons.cli.ParseException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
 ... 2 more
 
  Any comments on what I can try to do?
 
  Best,
  Philipp

-- 
Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a

Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha

hello,
I have a 42 GB file on the local fs(call the machine A)  which i need
to copy to a HDFS (replicattion 1), according the HDFS webtracker it
has 208GB across 7 machines.
Note, the machine A has about 80 GB total, so there is no place to
store copies of the file.
Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
with all blocks being bad. This is not surprising since the file is
copied entirely to the HDFS region that resides on A. Had the file
been copied across all machines, this would not have failed.

I have more experience with mapreduce and not much with the hdfs side
of things.
Is there a configuration option i'm missing that forces the file to be
split across the machines(when it is being copied)?
-- 
Saptarshi Guha - saptarshi.g...@gmail.com

Re: Too many open files in 0.18.3

2009-02-12 Thread Sean Knapp

Raghu,
Thanks for the quick response. I've been beating up on the cluster for a
while now and so far so good. I'm still at 8k... what should I expect to
find with 16k versus 1k? The 8k didn't appear to be affecting things to
begin with.

Regards,
Sean

On Thu, Feb 12, 2009 at 2:07 PM, Raghu Angadi rang...@yahoo-inc.com wrote:


 You are most likely hit by
 https://issues.apache.org/jira/browse/HADOOP-4346 . I hope it gets back
 ported. There is a 0.18 patch posted there.

 btw, does 16k help in your case?

 Ideally 1k should be enough (with small number of clients). Please try the
 above patch with 1k limit.

 Raghu.


 Sean Knapp wrote:

 Hi all,
 I'm continually running into the Too many open files error on 18.3:

 DataXceiveServer: java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at


 sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:145)

 at

 sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:96)

 at

 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)

 at java.lang.Thread.run(Thread.java:619)


 I'm writing thousands of files in the course of a few minutes, but nothing
 that seems too unreasonable, especially given the numbers below. I begin
 getting a surge of these warnings right as I hit 1024 files open by the
 DataNode:

 had...@u10:~$ ps ux | awk '/dfs\.DataNode/ { print $2 }' | xargs -i ls

 /proc/{}/fd | wc -l

  1023


 This is a bit unexpected, however, since I've configured my open file
 limit
 to be 16k:

 had...@u10:~$ ulimit -a
 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 268288
 max locked memory   (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files  (-n) 16384
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) 8192
 cpu time   (seconds, -t) unlimited
 max user processes  (-u) 268288
 virtual memory  (kbytes, -v) unlimited
 file locks  (-x) unlimited


 Note, I've also set dfs.datanode.max.xcievers to 8192 in hadoop-site.xml.

 Thanks in advance,
 Sean

Re: Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread TCK


Did you run the copy command from machine A? I believe that if you do the copy 
from an hdfs client that is on the same machine as a data node, then for each 
block the primary copy always goes to that data node, and only the replicas get 
distributed among other data nodes. I ran into this issue once -- I had to have 
the client doing the copy either on the master or on an off-cluster node.
-TCK



--- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote:
From: Saptarshi Guha saptarshi.g...@gmail.com
Subject: Very large file copied to cluster, and the copy fails. All blocks bad
To: core-user@hadoop.apache.org core-user@hadoop.apache.org
Date: Thursday, February 12, 2009, 9:50 PM

hello,
I have a 42 GB file on the local fs(call the machine A)  which i need
to copy to a HDFS (replicattion 1), according the HDFS webtracker it
has 208GB across 7 machines.
Note, the machine A has about 80 GB total, so there is no place to
store copies of the file.
Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
with all blocks being bad. This is not surprising since the file is
copied entirely to the HDFS region that resides on A. Had the file
been copied across all machines, this would not have failed.

I have more experience with mapreduce and not much with the hdfs side
of things.
Is there a configuration option i'm missing that forces the file to be
split across the machines(when it is being copied)?
-- 
Saptarshi Guha - saptarshi.g...@gmail.com

Re: Very large file copied to cluster, and the copy fails. All blocks bad

2009-02-12 Thread Saptarshi Guha

 Did you run the copy command from machine A?
Yes, exactly.
 I had to have the client doing the copy either on the master or on an 
 off-cluster
 Thanks! I uploaded it from an off cluster (i.e not participating in
the hdfs) and it worked splendidly.

Regards
Saptarshi


On Thu, Feb 12, 2009 at 11:03 PM, TCK moonwatcher32...@yahoo.com wrote:

 I believe that if you do the copy from an hdfs client that is on the
same machine as a data node, then for each block the primary copy
always goes to that data node, and only the replicas get distributed
among other data nodes. I ran into this issue once -- I had to have
the client doing the copy either on the master or on an off-cluster
node.
 -TCK



 --- On Thu, 2/12/09, Saptarshi Guha saptarshi.g...@gmail.com wrote:
 From: Saptarshi Guha saptarshi.g...@gmail.com
 Subject: Very large file copied to cluster, and the copy fails. All blocks bad
 To: core-user@hadoop.apache.org core-user@hadoop.apache.org
 Date: Thursday, February 12, 2009, 9:50 PM

 hello,
 I have a 42 GB file on the local fs(call the machine A)  which i need
 to copy to a HDFS (replicattion 1), according the HDFS webtracker it
 has 208GB across 7 machines.
 Note, the machine A has about 80 GB total, so there is no place to
 store copies of the file.
 Using the command bin/hadoop dfs -put /local/x /remote/tmp/ fails,
 with all blocks being bad. This is not surprising since the file is
 copied entirely to the HDFS region that resides on A. Had the file
 been copied across all machines, this would not have failed.

 I have more experience with mapreduce and not much with the hdfs side
 of things.
 Is there a configuration option i'm missing that forces the file to be
 split across the machines(when it is being copied)?
 --
 Saptarshi Guha - saptarshi.g...@gmail.com







-- 
Saptarshi Guha - saptarshi.g...@gmail.com

Re: Best practices on spliltting an input line?

HDFS on non-identical nodes

Re: stable version

Re: Backing up HDFS?

Re: stable version

Re: Best practices on spliltting an input line?

how to use FileSystem Object API

Re: how to use FileSystem Object API

Re: stable version

Re: Finding small subset in very large dataset

Re: Finding small subset in very large dataset

Re: Re: Re: Re: Re: Re: Regarding Hadoop multi cluster set-up

Re: HDFS on non-identical nodes

Re: what's going on :( ?

Re: HDFS on non-identical nodes

Re: stable version

RE: How to use DBInputFormat?

Too many open files in 0.18.3

Re: Too many open files in 0.18.3

Eclipse plugin

Measuring IO time in map/reduce jobs?

Re: Eclipse plugin

Hadoop User Group Meeting (Bay Area) 2/18

Re: Eclipse plugin

Re: Too many open files in 0.18.3

Running RowCounter as Standalone

Re: Running RowCounter as Standalone

Re: Running RowCounter as Standalone [solved]

Very large file copied to cluster, and the copy fails. All blocks bad

Re: Too many open files in 0.18.3

Re: Very large file copied to cluster, and the copy fails. All blocks bad

Re: Very large file copied to cluster, and the copy fails. All blocks bad

32 matches

Site Navigation

Mail list logo

Footer information