Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-03 Thread Tom White
Hi Brian,

Writes to HDFS are not guaranteed to be flushed until the file is
closed. In practice, as each (64MB) block is finished it is flushed
and will be visible to other readers, which is what you were seeing.

The addition of appends in HDFS changes this and adds a sync() method
to FSDataOutputStream. You can read about the semantics of the new
operations here:
https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc.
Unfortunately, there are some problems with sync() that are still
being worked through
(https://issues.apache.org/jira/browse/HADOOP-4379). Also, even with
sync() working, the append() on SequenceFile does not do an implicit
sync() - it is not atomic. Furthermore, there is no way to get hold of
the FSDataOutputStream to call sync() yourself - see
https://issues.apache.org/jira/browse/HBASE-1155. (And don't get
confused by the sync() method on SequenceFile.Writer - it is for
another purpose entirely.)

As Jason points out, the simplest way to achieve what you're trying to
so is to close the file and start a new one. If you start to get too
many small files, then you can have another process to merge the
smaller files in the background.

Tom

On Tue, Feb 3, 2009 at 3:57 AM, jason hadoop jason.had...@gmail.com wrote:
 If you have to do a time based solution, for now, simply close the file and
 stage it, then open a new file.
 Your reads will have to deal with the fact the file is in multiple parts.
 Warning: Datanodes get pokey if they have large numbers of blocks, and the
 quickest way to do this is to create a lot of small files.

 On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote:

 Let me rephrase this problem... as stated below, when I start writing to a
 SequenceFile from an HDFS client, nothing is visible in HDFS until I've
 written 64M of data. This presents three problems: fsck reports the file
 system as corrupt until the first block is finally written out, the
 presence
 of the file (without any data) seems to blow up my mapred jobs that try to
 make use of it under my input path, and finally, I want to basically flush
 every 15 minutes or so so I can mapred the latest data.
 I don't see any programmatic way to force the file to flush in 17.2.
 Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that
 not do what I think it does? What controls the 64M limit, anyway? Is it
 dfs.checkpoint.size or dfs.block.size? Is the buffering happening on
 the
 client, or on data nodes? Or in the namenode?

 It seems really bad that a SequenceFile, upon creation, is in an unusable
 state from the perspective of a mapred job, and also leaves fsck in a
 corrupt state. Surely I must be doing something wrong... but what? How can
 I
 ensure that a SequenceFile is immediately usable (but empty) on creation,
 and how can I make things flush on some regular time interval?

 Thanks,
 Brian


 On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote:

  I have a SequenceFile.Writer that I obtained via
 SequenceFile.createWriter
  and write to using append(key, value). Because the writer volume is low,
  it's not uncommon for it to take over a day for my appends to finally be
  flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
  Because I am running map/reduce tasks on this data multiple times a day,
 I
  want to flush the sequence file so the mapred jobs can pick it up when
  they run.
  What's the right way to do this? I'm assuming it's a fairly common use
  case. Also -- are writes to the sequence files atomic? (e.g. if I am
  actively appending to a sequence file, is it always safe to read from
 that
  same file in a mapred job?)
 
  To be clear, I want the flushing to be time based (controlled explicitly
 by
  the app), not size based. Will this create waste in HDFS somehow?
 
  Thanks,
  Brian
 
 




Re: hadoop to ftp files into hdfs

2009-02-03 Thread Tom White
NLineInputFormat is ideal for this purpose. Each split will be N lines
of input (where N is configurable), so each mapper can retrieve N
files for insertion into HDFS. You can set the number of redcers to
zero.

Tom

On Tue, Feb 3, 2009 at 4:23 AM, jason hadoop jason.had...@gmail.com wrote:
 If you have a large number of ftp urls spread across many sites, simply set
 that file to be your hadoop job input, and force the input split to be a
 size that gives you good distribution across your cluster.


 On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote:

 Does any one have a good suggestion on how to submit a hadoop job that
 will split the ftp retrieval of a number of files for insertion into
 hdfs?  I have been searching google for suggestions on this matter.
 Steve




Hadoop-KFS-FileSystem API

2009-02-03 Thread Wasim Bari
Hi,
I am looking to use KFS as storage with Hadoop FileSystem API.

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html

This page states about KFS usage with Hadoop and stated as last step to run 
map/reduce tracker.

Is it necessary to turn it on? 

How only storage works with FileSystem API ?

Thanks

Wasim

Re: problem with completion notification from block movement

2009-02-03 Thread Karl Kleinpaste
On Mon, 2009-02-02 at 20:06 -0800, jason hadoop wrote:
 This can be made significantly worse by your underlying host file
 system and the disks that support it.

Oh, yes, we know...  It was a late-realized mistake just yesterday that
we weren't using noatime on that cluster's slaves.

The attached graph is instructive.  We have our nightly-rotated logs for
DataNode all the way back to when this test cluster was created in
November.  This morning on one node, I sampled the first 10 BlockReport
scan lines from each day's log, up through the current hour today, and
handed it to gnuplot to graph.  The seriously erratic behavior that
begins around the 900K-1M point is very disturbing.

Immediate solutions for us include noatime, nodiratime, BIOS upgrade on
the discs, and eliminating enough small files (blocks) in DFS to get the
total count below 400K.


Is hadoop right for my problem

2009-02-03 Thread cdwillie76

I have an application I would like to apply hadoop to but I'm not sure if the
tasking is too small.  I have a file that contains between 70,000 - 400,000
records.  All the records can be processed in parallel and I can currently
process them at 400 records a second single threaded (give or take).  I
thought I read somewhere (one of the tutorials) that the mapper tasks should
run at least for a minute to offset the overhead in creating them.  Is this
really the case?  I am pretty sure that a one to one record to mapper is
overkill but I am wondering if I batching them up for the mapper is still a
way to go or if I should look at some other framework to help split up the
processing.

Any insight would be appreciated.

Thanks
Chris
-- 
View this message in context: 
http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Hadoop-KFS-FileSystem API

2009-02-03 Thread lohit
Hi Wasim,

Here is what you could do.
1. Deploy KFS
2. Build a hadoop-site.xml config file and set fs.default.name and other config 
variables to point to KFS as described by 
(http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html)
3. If you place this hadoop-site.xml is a directory say ~/foo/myconf, then you 
could use hadoop commands as ./bin/hadoop --config ~/foo/myconf fs -ls 
/foo/bar.txt 
4. If you want to use Hadoop FileSysemt API, just put this directory as the 
first entry in your classpath, so that new configuration object loads this 
hadoop-site.xml and your FileSystem API talk to KFS.
5. Alternatively you could also create an object of KosmosFileSystem, which 
extends from FileSystem. Look at org.apache.hadoop.fs.kfs.KosmosFileSystem for 
example.

Lohit



- Original Message 
From: Wasim Bari wasimb...@msn.com
To: core-user@hadoop.apache.org
Sent: Tuesday, February 3, 2009 3:03:51 AM
Subject: Hadoop-KFS-FileSystem API

Hi,
I am looking to use KFS as storage with Hadoop FileSystem API.

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/kfs/package-summary.html

This page states about KFS usage with Hadoop and stated as last step to run 
map/reduce tracker.

Is it necessary to turn it on? 

How only storage works with FileSystem API ?

Thanks

Wasim


HDD benchmark/checking tool

2009-02-03 Thread Dmitry Pushkarev
Dear hadoop users,

 

Recently I have had a number of drive failures that slowed down processes a
lot until they were discovered. It is there any easy way or tool, to check
HDD performance and see if there any IO errors?

Currently I wrote a simple script that looks at /var/log/messages and greps
everything abnormal for /dev/sdaX. But if you have better solution I'd
appreciate if you share it.

 

---

Dmitry Pushkarev

+1-650-644-8988

 



Re: Is hadoop right for my problem

2009-02-03 Thread Brian Bockelman

Hey Chris,

I think it would be appropriate.  Look at it this way, it takes 1  
mapper 1 minute to process 24k records, so it should take about 17  
mappers to process all your tasks for the largest problem in one minute.


Even if you still think your problem is too small, consider:
1) The possibility of growth in your application.  You're processing  
becomes future proof - you have a pretty solid way to scale out as  
your task grows.  Just add new machines -- you don't have to invest in  
a small scale framework then rewrite in a year.
2) The benefits of having a framework do the heavy lifting.  There's a  
surprising amount of roll your own that you end up doing when you  
decide to break out of a single thread.  By framing your problem as a  
map-reduce problem, you get to skip a lot of these steps and just  
focus on solving your problem (also: beware that it's very sexy to  
build your own MapReduce framework.  Anything which is very sexy  
takes up more time and money than you think possible at the outset).


Brian

On Feb 3, 2009, at 8:34 AM, cdwillie76 wrote:



I have an application I would like to apply hadoop to but I'm not  
sure if the
tasking is too small.  I have a file that contains between 70,000 -  
400,000
records.  All the records can be processed in parallel and I can  
currently
process them at 400 records a second single threaded (give or  
take).  I
thought I read somewhere (one of the tutorials) that the mapper  
tasks should
run at least for a minute to offset the overhead in creating them.   
Is this
really the case?  I am pretty sure that a one to one record to  
mapper is
overkill but I am wondering if I batching them up for the mapper is  
still a
way to go or if I should look at some other framework to help split  
up the

processing.

Any insight would be appreciated.

Thanks
Chris
--
View this message in context: 
http://www.nabble.com/Is-hadoop-right-for-my-problem-tp21811122p21811122.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Cascading support of HBase

2009-02-03 Thread Chris K Wensel

Hey all,

Just wanted to let everyone know the HBase Hackathon was a success  
(thanks Streamy!), and we now have Cascading Tap adapters for HBase.


You can find the link here (along with additional third-party  
extensions in the near future).

http://www.cascading.org/modules.html

Please feel free to give it a test, clone the repo, and submit patches  
back to me via GitHub.


The unit test shows how simple it is to import/export data to/from an  
HBase cluster.

http://bit.ly/fIpAE

For those not familiar with Cascading, it is an alternative processing  
API that is generally more natural to develop and think in than  
MapReduce.

http://www.cascading.org/

enjoy,

chris

p.s. If you have any code you want to contribute back, just stick it  
on GitHub and send me a link.


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



extreme nubbie need help setting up hadoop

2009-02-03 Thread bjday

Good afternoon all,

I work tech and an extreme nubbie at hadoop.  I could sure use some 
help.  I have a professor wanting hadoop installed on multiple Linux 
computers in a lab.  The computers are running CentOS 5.  I know i have 
something configured wrong and am not sure where to go.  I am following 
the instructions at 
http://www.cs.brandeis.edu/~cs147a/lab/hadoop-cluster/   I get to the 
part Testing Your Hadoop Cluster but when i use the command hadoop jar 
hadoop-*-examples.jar grep input output 'dfs[a-z.]+'   It hangs.  Could 
anyone be kind enough to point me to a step by step instillation and 
configuration website?


Thank you
Brian


Re: extreme nubbie need help setting up hadoop

2009-02-03 Thread Ravindra Phulari
Hello Brian ,
  Here is the Hadoop project Wiki link which covers detailed Hadoop 
setup and running your first program on Single node as well as on multiple 
nodes .
   Also below are some more useful links to start understanding and using 
Hadoop.

http://hadoop.apache.org/core/docs/current/quickstart.html
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

 If you still have difficulties running Hadoop based programs please reply 
with error output so that experts can comment.
-
Ravi


On 2/3/09 10:08 AM, bjday bj...@cse.usf.edu wrote:

Good afternoon all,

I work tech and an extreme nubbie at hadoop.  I could sure use some
help.  I have a professor wanting hadoop installed on multiple Linux
computers in a lab.  The computers are running CentOS 5.  I know i have
something configured wrong and am not sure where to go.  I am following
the instructions at
http://www.cs.brandeis.edu/~cs147a/lab/hadoop-cluster/   I get to the
part Testing Your Hadoop Cluster but when i use the command hadoop jar
hadoop-*-examples.jar grep input output 'dfs[a-z.]+'   It hangs.  Could
anyone be kind enough to point me to a step by step instillation and
configuration website?

Thank you
Brian


Ravi Phulari
Yahoo! IM : ravescorp |Office Phone: (408)-336-0806 |
--



Re: Control over max map/reduce tasks per job

2009-02-03 Thread Chris K Wensel

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/ 
reducers a single job can consume cluster wide, or limit the number  
per node?


That is, you have X mappers/reducers, but only can allow N mappers/ 
reducers to run at a time globally, for a given job.


Or, you are cool with all X running concurrently globally, but want to  
guarantee that no node can run more than N tasks from that job?


Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having  
trouble

keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have drastically
different patterns.  I have jobs that read/write to/from HBase or  
Hadoop

with minimal logic (network throughput bound or io bound), others that
perform crawling (network latency bound), and one huge parsing  
streaming job

(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency  
bound
jobs, however the large CPU bound job means I have to keep the max  
maps
allowed per node low enough as to not starve the Datanode and  
Regionserver.




I'm an HBase dev but not familiar enough with Hadoop MR code to even  
know
what would be involved with implementing this.  However, in talking  
with

other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems like
someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



RE: Control over max map/reduce tasks per job

2009-02-03 Thread Jonathan Gray
Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at most 2 at
a time of this CPU bound task on any given node).  However, the other way
would work as well (on 10 node system, would set job to max 20 tasks at a
time globally), but opens up the possibility that a node could be assigned
more than 2 of that task.

I would work with whatever is easiest to implement as either would be a vast
improvement for me (can run high numbers of network latency bound tasks
without fear of cpu bound tasks killing the cluster).

JG



 -Original Message-
 From: Chris K Wensel [mailto:ch...@wensel.net]
 Sent: Tuesday, February 03, 2009 11:34 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Control over max map/reduce tasks per job
 
 Hey Jonathan
 
 Are you looking to limit the total number of concurrent mapper/
 reducers a single job can consume cluster wide, or limit the number
 per node?
 
 That is, you have X mappers/reducers, but only can allow N mappers/
 reducers to run at a time globally, for a given job.
 
 Or, you are cool with all X running concurrently globally, but want to
 guarantee that no node can run more than N tasks from that job?
 
 Or both?
 
 just reconciling the conversation we had last week with this thread.
 
 ckw
 
 On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:
 
  All,
 
 
 
  I have a few relatively small clusters (5-20 nodes) and am having
  trouble
  keeping them loaded with my MR jobs.
 
 
 
  The primary issue is that I have different jobs that have drastically
  different patterns.  I have jobs that read/write to/from HBase or
  Hadoop
  with minimal logic (network throughput bound or io bound), others
 that
  perform crawling (network latency bound), and one huge parsing
  streaming job
  (very CPU bound, each task eats a core).
 
 
 
  I'd like to launch very large numbers of tasks for network latency
  bound
  jobs, however the large CPU bound job means I have to keep the max
  maps
  allowed per node low enough as to not starve the Datanode and
  Regionserver.
 
 
 
  I'm an HBase dev but not familiar enough with Hadoop MR code to even
  know
  what would be involved with implementing this.  However, in talking
  with
  other users, it seems like this would be a well-received option.
 
 
 
  I wanted to ping the list before filing an issue because it seems
 like
  someone may have thought about this in the past.
 
 
 
  Thanks.
 
 
 
  Jonathan Gray
 
 
 --
 Chris K Wensel
 ch...@wensel.net
 http://www.cascading.org/
 http://www.scaleunlimited.com/



Re: Control over max map/reduce tasks per job

2009-02-03 Thread jason hadoop
An alternative is to have 2 Tasktracker clusters, where the nodes are on the
same machines.
One cluster is for IO intensive jobs and has a low number of map/reduces per
tracker,
the other cluster is for cpu intensive jobs and has a high number of
map/reduces per tracker.

The alternative, simpler method is to use a multi-threaded mapper on the cpu
intensive jobs, where you tune the thread count on a per job basis.

In the longer term being able to alter the Tasktracker control parameters at
run time on a per job basis would be wonderful.

On Tue, Feb 3, 2009 at 12:01 PM, Nathan Marz nat...@rapleaf.com wrote:

 This is a great idea. For me, this is related to:
 https://issues.apache.org/jira/browse/HADOOP-5160. Being able to set the
 number of tasks per machine on a job by job basis would allow me to solve my
 problem in a different way. Looking at the Hadoop source, it's also probably
 simpler than changing how Hadoop schedules tasks.





 On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:

  Chris,

 For my specific use cases, it would be best to be able to set N
 mappers/reducers per job per node (so I can explicitly say, run at most 2
 at
 a time of this CPU bound task on any given node).  However, the other way
 would work as well (on 10 node system, would set job to max 20 tasks at a
 time globally), but opens up the possibility that a node could be assigned
 more than 2 of that task.

 I would work with whatever is easiest to implement as either would be a
 vast
 improvement for me (can run high numbers of network latency bound tasks
 without fear of cpu bound tasks killing the cluster).

 JG



  -Original Message-
 From: Chris K Wensel [mailto:ch...@wensel.net]
 Sent: Tuesday, February 03, 2009 11:34 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Control over max map/reduce tasks per job

 Hey Jonathan

 Are you looking to limit the total number of concurrent mapper/
 reducers a single job can consume cluster wide, or limit the number
 per node?

 That is, you have X mappers/reducers, but only can allow N mappers/
 reducers to run at a time globally, for a given job.

 Or, you are cool with all X running concurrently globally, but want to
 guarantee that no node can run more than N tasks from that job?

 Or both?

 just reconciling the conversation we had last week with this thread.

 ckw

 On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:

  All,



 I have a few relatively small clusters (5-20 nodes) and am having
 trouble
 keeping them loaded with my MR jobs.



 The primary issue is that I have different jobs that have drastically
 different patterns.  I have jobs that read/write to/from HBase or
 Hadoop
 with minimal logic (network throughput bound or io bound), others

 that

 perform crawling (network latency bound), and one huge parsing
 streaming job
 (very CPU bound, each task eats a core).



 I'd like to launch very large numbers of tasks for network latency
 bound
 jobs, however the large CPU bound job means I have to keep the max
 maps
 allowed per node low enough as to not starve the Datanode and
 Regionserver.



 I'm an HBase dev but not familiar enough with Hadoop MR code to even
 know
 what would be involved with implementing this.  However, in talking
 with
 other users, it seems like this would be a well-received option.



 I wanted to ping the list before filing an issue because it seems

 like

 someone may have thought about this in the past.



 Thanks.



 Jonathan Gray


 --
 Chris K Wensel
 ch...@wensel.net
 http://www.cascading.org/
 http://www.scaleunlimited.com/






Re: Control over max map/reduce tasks per job

2009-02-03 Thread Nathan Marz
Another use case for per-job task limits is being able to use every  
core in the cluster on a map-only job.




On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:


Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at  
most 2 at
a time of this CPU bound task on any given node).  However, the  
other way
would work as well (on 10 node system, would set job to max 20 tasks  
at a
time globally), but opens up the possibility that a node could be  
assigned

more than 2 of that task.

I would work with whatever is easiest to implement as either would  
be a vast
improvement for me (can run high numbers of network latency bound  
tasks

without fear of cpu bound tasks killing the cluster).

JG




-Original Message-
From: Chris K Wensel [mailto:ch...@wensel.net]
Sent: Tuesday, February 03, 2009 11:34 AM
To: core-user@hadoop.apache.org
Subject: Re: Control over max map/reduce tasks per job

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/
reducers a single job can consume cluster wide, or limit the number
per node?

That is, you have X mappers/reducers, but only can allow N mappers/
reducers to run at a time globally, for a given job.

Or, you are cool with all X running concurrently globally, but want  
to

guarantee that no node can run more than N tasks from that job?

Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having
trouble
keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have  
drastically

different patterns.  I have jobs that read/write to/from HBase or
Hadoop
with minimal logic (network throughput bound or io bound), others

that

perform crawling (network latency bound), and one huge parsing
streaming job
(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency
bound
jobs, however the large CPU bound job means I have to keep the max
maps
allowed per node low enough as to not starve the Datanode and
Regionserver.



I'm an HBase dev but not familiar enough with Hadoop MR code to even
know
what would be involved with implementing this.  However, in talking
with
other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems

like

someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/






Re: FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
Hmm.  Based on your reasons, an extension to FileInputFormat for the  
lib package seems more in order.


I'll try to hack something up and file a Jira issue.

Ian

On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote:


Hi, Ian.

One reason is that a MapFile is represented by a directory  
containing two files named index and data.   
SequenceFileInputFormat handles MapFiles too by, if an input file is  
a directory containing a data file, using that file.


Another reason is that's what reduces generate.

Neither reason implies that this is the best or only way of doing  
things.  It would probably be better if FileInputFormat optionally  
supported recursive file enumeration.  (It would be incompatible and  
thus cannot be the default mode.)


Please file an issue in Jira for this and attach your patch.

Thanks,

Doug

Ian Soboroff wrote:
Is there a reason FileInputFormat only traverses the first level of  
directories in its InputPaths?  (i.e., given an InputPath of 'foo',  
it will get foo/* but not foo/bar/*).
I wrote a full depth-first traversal in my custom InputFormat which  
I can offer as a patch.  But to do it I had to duplicate the  
PathFilter classes in FileInputFormat which are marked private, so  
a mainline patch would also touch FileInputFormat.

Ian




Re: Control over max map/reduce tasks per job

2009-02-03 Thread Nathan Marz
This is a great idea. For me, this is related to: https://issues.apache.org/jira/browse/HADOOP-5160 
. Being able to set the number of tasks per machine on a job by job  
basis would allow me to solve my problem in a different way. Looking  
at the Hadoop source, it's also probably simpler than changing how  
Hadoop schedules tasks.





On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:


Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at  
most 2 at
a time of this CPU bound task on any given node).  However, the  
other way
would work as well (on 10 node system, would set job to max 20 tasks  
at a
time globally), but opens up the possibility that a node could be  
assigned

more than 2 of that task.

I would work with whatever is easiest to implement as either would  
be a vast
improvement for me (can run high numbers of network latency bound  
tasks

without fear of cpu bound tasks killing the cluster).

JG




-Original Message-
From: Chris K Wensel [mailto:ch...@wensel.net]
Sent: Tuesday, February 03, 2009 11:34 AM
To: core-user@hadoop.apache.org
Subject: Re: Control over max map/reduce tasks per job

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/
reducers a single job can consume cluster wide, or limit the number
per node?

That is, you have X mappers/reducers, but only can allow N mappers/
reducers to run at a time globally, for a given job.

Or, you are cool with all X running concurrently globally, but want  
to

guarantee that no node can run more than N tasks from that job?

Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having
trouble
keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have  
drastically

different patterns.  I have jobs that read/write to/from HBase or
Hadoop
with minimal logic (network throughput bound or io bound), others

that

perform crawling (network latency bound), and one huge parsing
streaming job
(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency
bound
jobs, however the large CPU bound job means I have to keep the max
maps
allowed per node low enough as to not starve the Datanode and
Regionserver.



I'm an HBase dev but not familiar enough with Hadoop MR code to even
know
what would be involved with implementing this.  However, in talking
with
other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems

like

someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/






hadoop dfs -test question (with a bit o' Ruby)

2009-02-03 Thread S D
I'm at my wit's end. I want to do a simple test for the existence of a file
on Hadoop. Here is the Ruby code I'm trying:

val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt`
puts Val: #{val}
if val == 1
  // do one thing
else
  // do another
end

I never get  a return value for the -test command. Is there something I'm
missing? How should the return value be retrieved?

Thanks,
SD


Re: hadoop dfs -test question (with a bit o' Ruby)

2009-02-03 Thread S D
Coincidentally I'm aware of the AWS::S3 package in Ruby but I'd prefer to
avoid that...

On Tue, Feb 3, 2009 at 5:02 PM, S D sd.codewarr...@gmail.com wrote:

 I'm at my wit's end. I want to do a simple test for the existence of a file
 on Hadoop. Here is the Ruby code I'm trying:

 val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt`
 puts Val: #{val}
 if val == 1
   // do one thing
 else
   // do another
 end

 I never get  a return value for the -test command. Is there something I'm
 missing? How should the return value be retrieved?

 Thanks,
 SD



Unable to pull data from DB in Hadoop job

2009-02-03 Thread Amandeep Khurana
I am trying to pull data out from an oracle database using a hadoop job and
am getting the following error which I am unable to debug.

09/02/03 15:32:51 INFO mapred.JobClient: Task Id :
attempt_200902012320_0022_m_01_2, Status : FAILED
java.io.IOException: ORA-00933: SQL command not properly ended

at
org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at LoadTable1.run(LoadTable1.java:130)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at LoadTable1.main(LoadTable1.java:107)

Can anyone give me pointers on this?

Thanks
Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: FileInputFormat directory traversal

2009-02-03 Thread Doug Cutting

Hi, Ian.

One reason is that a MapFile is represented by a directory containing 
two files named index and data.  SequenceFileInputFormat handles 
MapFiles too by, if an input file is a directory containing a data file, 
using that file.


Another reason is that's what reduces generate.

Neither reason implies that this is the best or only way of doing 
things.  It would probably be better if FileInputFormat optionally 
supported recursive file enumeration.  (It would be incompatible and 
thus cannot be the default mode.)


Please file an issue in Jira for this and attach your patch.

Thanks,

Doug

Ian Soboroff wrote:
Is there a reason FileInputFormat only traverses the first level of 
directories in its InputPaths?  (i.e., given an InputPath of 'foo', it 
will get foo/* but not foo/bar/*).


I wrote a full depth-first traversal in my custom InputFormat which I 
can offer as a patch.  But to do it I had to duplicate the PathFilter 
classes in FileInputFormat which are marked private, so a mainline patch 
would also touch FileInputFormat.


Ian



Re: HDD benchmark/checking tool

2009-02-03 Thread Aaron Kimball
Dmitry,

Look into cluster/system monitoring tools: nagios and ganglia are two to
start with.
- Aaron

On Tue, Feb 3, 2009 at 9:53 AM, Dmitry Pushkarev u...@stanford.edu wrote:

 Dear hadoop users,



 Recently I have had a number of drive failures that slowed down processes a
 lot until they were discovered. It is there any easy way or tool, to check
 HDD performance and see if there any IO errors?

 Currently I wrote a simple script that looks at /var/log/messages and greps
 everything abnormal for /dev/sdaX. But if you have better solution I'd
 appreciate if you share it.



 ---

 Dmitry Pushkarev

 +1-650-644-8988






Re: HDD benchmark/checking tool

2009-02-03 Thread Brian Bockelman
Also, you want to look at combining SMART hard drive monitoring (most  
drives support SMART at this point) and combine it with Nagios.


It often lets us known when a hard drive is about to fail *and* when  
the drive is under-performing.


Brian

On Feb 3, 2009, at 6:18 PM, Aaron Kimball wrote:


Dmitry,

Look into cluster/system monitoring tools: nagios and ganglia are  
two to

start with.
- Aaron

On Tue, Feb 3, 2009 at 9:53 AM, Dmitry Pushkarev u...@stanford.edu  
wrote:



Dear hadoop users,



Recently I have had a number of drive failures that slowed down  
processes a
lot until they were discovered. It is there any easy way or tool,  
to check

HDD performance and see if there any IO errors?

Currently I wrote a simple script that looks at /var/log/messages  
and greps
everything abnormal for /dev/sdaX. But if you have better solution  
I'd

appreciate if you share it.



---

Dmitry Pushkarev

+1-650-644-8988








Re: Hadoop's reduce tasks are freezes at 0%.

2009-02-03 Thread Aaron Kimball
Alternatively, maybe all the TaskTracker nodes can contact the NameNode and
JobTracker, but cannot communicate with one another due to firewall issues?
- Aaron

On Mon, Feb 2, 2009 at 7:54 PM, jason hadoop jason.had...@gmail.com wrote:

 A reduce stall at 0% implies that the map tasks are not outputting any
 records via the output collector.
 You need to go look at the task tracker and the task logs on all of your
 slave machines, to see if anything that seems odd appears in the logs.
 On the tasktracker web interface detail screen for your job,
 Are all of the map tasks finished
 Are any of the map tasks started
 Are there any Tasktracker nodes to service your job

 On Sun, Feb 1, 2009 at 11:41 PM, Kwang-Min Choi kmbest.c...@samsung.com
 wrote:

  I'm newbie in Hadoop.
  and i'm trying to follow Hadoop Quick Guide at hadoop homepage.
  but, there are some problems...
  Downloading, unzipping hadoop is done.
  and ssh successfully operate without password phrase.
 
  once... I execute grep example attached to Hadoop...
  map task is ok. it reaches 100%.
  but reduce task freezes at 0% without any error message.
  I've waited it for more than 1 hour, but it still freezes...
  same job in standalone mode is well done...
 
  i tried it with version 0.18.3 and 0.17.2.1.
  all of them had same problem.
 
  could help me to solve this problem?
 
  Additionally...
  I'm working on cloud-infra of GoGrid(Redhat).
  So, disk's space  health is OK.
  and, i've installed JDK 1.6.11 for linux successfully.
 
 
  - KKwams
 



Re: extra documentation on how to write your own partitioner class

2009-02-03 Thread Aaron Kimball
er?

It seems to be using value.get(). That having been said, you should really
partition based on key, not on value. (I am not sure why, exactly, the value
is provided to the getPartition() method.)


Moreover, I think the problem is that you are using division ( / ) not
modulus ( % ).  Your code simplifies to:   (value.get() / T) / (T /
numPartitions) = value.get() * numPartitions / T^2.

The contract of getPartition() is that it returns a value in [0,
numPartitions). The division operators are not guaranteed to return anything
in this range, but (foo % numPartitions) will always do the right thing. So
it's probably just  assigning everything to reduce partition 0.
(Alternatively, it could be that value * numPartitions  T^2 for any values
of T you're testing with, which means that integer division will return 0.)

- Aaron


On Fri, Jan 30, 2009 at 3:43 PM, Sandy snickerdoodl...@gmail.com wrote:

 Hi James,

 Thank you very much! :-)

 -SM

 On Fri, Jan 30, 2009 at 4:17 PM, james warren ja...@rockyou.com wrote:

  Hello Sandy -
  Your partitioner isn't using any information from the key/value pair -
 it's
  only using the value T which is read once from the job configuration.
   getPartition() will always return the same value, so all of your data is
  being sent to one reducer. :P
 
  cheers,
  -James
 
  On Fri, Jan 30, 2009 at 1:32 PM, Sandy snickerdoodl...@gmail.com
 wrote:
 
   Hello,
  
   Could someone point me toward some more documentation on how to write
  one's
   own partition class? I have having quite a bit of trouble getting mine
 to
   work. So far, it looks something like this:
  
   public class myPartitioner extends MapReduceBase implements
   PartitionerIntWritable, IntWritable {
  
  private int T;
  
  public void configure(JobConf job) {
  super.configure(job);
  String myT = job.get(tval);//this is user defined
  T = Integer.parseInt(myT);
  }
  
  public int getPartition(IntWritable key, IntWritable value, int
   numReduceTasks) {
  int newT = (T/numReduceTasks);
  int id = ((value.get()/ T);
  return (int)(id/newT);
  }
   }
  
   In the run() function of my M/R program I just set it using:
  
   conf.setPartitionerClass(myPartitioner.class);
  
   Is there anything else I need to set in the run() function?
  
  
   The code compiles fine. When I run it, I know it is using the
   partitioner,
   since I get different output than if I just let it use HashPartitioner.
   However, it is not splitting between the reducers at all! If I set the
   number of reducers to 2, all the output shows up in part-0, while
   part-1 has nothing.
  
   I am having trouble debugging this since I don't know how I can observe
  the
   values of numReduceTasks (which I assume is being set by the system).
 Is
   this a proper assumption?
  
   If I try to insert any println() statements in the function, it isn't
   outputted to either my terminal or my log files. Could someone give me
  some
   general advice on how best to debug pieces of code like this?
  
 



Hadoop FS Shell - command overwrite capability

2009-02-03 Thread S D
I'm using the Hadoop FS commands to move files from my local machine into
the Hadoop dfs. I'd like a way to force a write to the dfs even if a file of
the same name exists. Ideally I'd like to use a -force switch or some
such; e.g.,
hadoop dfs -copyFromLocal -force adirectory s3n://wholeinthebucket/

Is there a way to do this or does anyone know if this is in the future
Hadoop plans?

Thanks
John SD


Control over max map/reduce tasks per job

2009-02-03 Thread Jonathan Gray
All,

 

I have a few relatively small clusters (5-20 nodes) and am having trouble
keeping them loaded with my MR jobs.

 

The primary issue is that I have different jobs that have drastically
different patterns.  I have jobs that read/write to/from HBase or Hadoop
with minimal logic (network throughput bound or io bound), others that
perform crawling (network latency bound), and one huge parsing streaming job
(very CPU bound, each task eats a core).

 

I'd like to launch very large numbers of tasks for network latency bound
jobs, however the large CPU bound job means I have to keep the max maps
allowed per node low enough as to not starve the Datanode and Regionserver.

 

I'm an HBase dev but not familiar enough with Hadoop MR code to even know
what would be involved with implementing this.  However, in talking with
other users, it seems like this would be a well-received option.

 

I wanted to ping the list before filing an issue because it seems like
someone may have thought about this in the past.

 

Thanks.

 

Jonathan Gray



FileInputFormat directory traversal

2009-02-03 Thread Ian Soboroff
Is there a reason FileInputFormat only traverses the first level of  
directories in its InputPaths?  (i.e., given an InputPath of 'foo', it  
will get foo/* but not foo/bar/*).


I wrote a full depth-first traversal in my custom InputFormat which I  
can offer as a patch.  But to do it I had to duplicate the PathFilter  
classes in FileInputFormat which are marked private, so a mainline  
patch would also touch FileInputFormat.


Ian



Chukwa documentation

2009-02-03 Thread xavier.quintuna
Hi Everybody,

I don't know if there is a mail list for Chukwa so I apologies in
advance if this is not the right place to ask my questions.

I have the following questions and comments:

It was simple the configuration of the collector and the agent. 
However, there is other features that are not documented it all like:
- torque (Do I have to install torque before? Yes? No? and Why?), 
- database, (Do I have to have a DB?) 
-what is queueinfo.properties, which kind of information provides me?
-and there is more stuff that I need to dig in the code to understand.
Could somebody update the documentation from Chukwa?.


I hope I can get some help

Xavier


How to use DBInputFormat?

2009-02-03 Thread Amandeep Khurana
In the setInput(...) function in DBInputFormat, there are two sets of
arguments that one can use.

1. public static void *setInput*(JobConf
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html
job,
Class
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
extends DBWritable
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBWritable.html
inputClass,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
tableName,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
conditions,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
orderBy,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true...
fieldNames)

a) In this, do we necessarily have to give all the fieldNames (which are the
column names right?) that the table has, or do we need to specify only the
ones that we want to extract?
b) Do we have to have a orderBy or not necessarily? Does this relate to the
primary key in the table in any ways?

2. public static void *setInput*(JobConf
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html
job,
Class
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true?
extends DBWritable
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/db/DBWritable.html
inputClass,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
inputQuery,
String
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true
inputCountQuery)

a) Is there any restriction on the kind of queries that this function
can take in the inputQuery string?

I am facing issues in getting this to work with an Oracle database and
have no idea of how to debug it (an email sent earlier).
Can anyone give me some inputs on this please?

Thanks
-Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


HADOOP-2536 supports Oracle too?

2009-02-03 Thread Amandeep Khurana
Does the patch HADOOP-2536 support connecting to Oracle databases as well?
Or is it just limited to MySQL?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: HADOOP-2536 supports Oracle too?

2009-02-03 Thread Aaron Kimball
HADOOP-2536 supports the JDBC connector interface. Any database that exposes
a JDBC library will work
- Aaron

On Tue, Feb 3, 2009 at 6:17 PM, Amandeep Khurana ama...@gmail.com wrote:

 Does the patch HADOOP-2536 support connecting to Oracle databases as well?
 Or is it just limited to MySQL?

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz



Re: How to use DBInputFormat?

2009-02-03 Thread Kevin Peterson
On Tue, Feb 3, 2009 at 5:49 PM, Amandeep Khurana ama...@gmail.com wrote:

 In the setInput(...) function in DBInputFormat, there are two sets of
 arguments that one can use.

 1. public static void *setInput*(JobConf

 a) In this, do we necessarily have to give all the fieldNames (which are
 the
 column names right?) that the table has, or do we need to specify only the
 ones that we want to extract?


You may specify only those columns that you are interested in.

b) Do we have to have a orderBy or not necessarily? Does this relate to the
 primary key in the table in any ways?


Conditions and order by are not necessary.

a) Is there any restriction on the kind of queries that this function
 can take in the inputQuery string?


I don't think so, but I don't use this method -- I just use the fieldNames
and tableName method.


 I am facing issues in getting this to work with an Oracle database and
 have no idea of how to debug it (an email sent earlier).
 Can anyone give me some inputs on this please?


Create a new table that has one column, put about five entries into that
table, then try to get a map job working that outputs the values to a text
file. If that doesn't work, post your code and errors.


Re: How to use DBInputFormat?

2009-02-03 Thread Amandeep Khurana
Thanks Kevin

I couldnt get it work. Here's the error I get:

bin/hadoop jar ~/dbload.jar LoadTable1
09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: ORA-00933: SQL command not properly ended

at
org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
at LoadTable1.run(LoadTable1.java:130)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at LoadTable1.main(LoadTable1.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

Exception closing file
/user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
at
org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
at
org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
at
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
at
org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)


Here's my code:

public class LoadTable1 extends Configured implements Tool  {

  // data destination on hdfs
  private static final String CONTRACT_OUTPUT_PATH = contract_table;

  // The JDBC connection URL and driver implementation class

private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost
:1521:PSEDEV;
  private static final String DB_USER = user;
  private static final String DB_PWD = pass;
  private static final String DATABASE_DRIVER_CLASS =
oracle.jdbc.driver.OracleDriver;

  private static final String CONTRACT_INPUT_TABLE = OSE_EPR_CONTRACT;

  private static final String [] CONTRACT_INPUT_TABLE_FIELDS = {
PORTFOLIO_NUMBER, CONTRACT_NUMBER};

  private static final String ORDER_CONTRACT_BY_COL = CONTRACT_NUMBER;


static class ose_epr_contract implements Writable, DBWritable {


String CONTRACT_NUMBER;


public void readFields(DataInput in) throws IOException {

this.CONTRACT_NUMBER = Text.readString(in);

}

public void write(DataOutput out) throws IOException {

Text.writeString(out, this.CONTRACT_NUMBER);


}

public void readFields(ResultSet in_set) throws SQLException {

this.CONTRACT_NUMBER = in_set.getString(1);

}

@Override
public void write(PreparedStatement prep_st) throws SQLException {
// TODO Auto-generated method stub

}

}

public static class LoadMapper extends MapReduceBase
implements MapperLongWritable,
ose_epr_contract, Text, NullWritable {
private static final char FIELD_SEPARATOR = 1;

public void map(LongWritable arg0, ose_epr_contract arg1,
OutputCollectorText, NullWritable arg2, Reporter arg3)
throws IOException {

StringBuilder sb = new StringBuilder();
;
sb.append(arg1.CONTRACT_NUMBER);


arg2.collect(new Text (sb.toString()), NullWritable.get());

}

}


public static void main(String[] args) throws Exception {
Class.forName(oracle.jdbc.driver.OracleDriver);
int exit = ToolRunner.run(new LoadTable1(), args);

}

public int run(String[] arg0) throws Exception {
JobConf conf = new 

Re: How to use DBInputFormat?

2009-02-03 Thread Amandeep Khurana
The same query is working if I write a simple JDBC client and query the
database. So, I'm probably doing something wrong in the connection settings.
But the error looks to be on the query side more than the connection side.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote:

 Thanks Kevin

 I couldnt get it work. Here's the error I get:

 bin/hadoop jar ~/dbload.jar LoadTable1
 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
 java.io.IOException: ORA-00933: SQL command not properly ended

 at
 org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at LoadTable1.run(LoadTable1.java:130)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at LoadTable1.main(LoadTable1.java:107)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 Exception closing file
 /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
 at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
 at
 org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
 at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
 at
 org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
 at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
 at
 org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)


 Here's my code:

 public class LoadTable1 extends Configured implements Tool  {

   // data destination on hdfs
   private static final String CONTRACT_OUTPUT_PATH = contract_table;

   // The JDBC connection URL and driver implementation class

 private static final String CONNECT_URL = jdbc:oracle:thin:@dbhost
 :1521:PSEDEV;
   private static final String DB_USER = user;
   private static final String DB_PWD = pass;
   private static final String DATABASE_DRIVER_CLASS =
 oracle.jdbc.driver.OracleDriver;

   private static final String CONTRACT_INPUT_TABLE =
 OSE_EPR_CONTRACT;

   private static final String [] CONTRACT_INPUT_TABLE_FIELDS = {
 PORTFOLIO_NUMBER, CONTRACT_NUMBER};

   private static final String ORDER_CONTRACT_BY_COL =
 CONTRACT_NUMBER;


 static class ose_epr_contract implements Writable, DBWritable {


 String CONTRACT_NUMBER;


 public void readFields(DataInput in) throws IOException {

 this.CONTRACT_NUMBER = Text.readString(in);

 }

 public void write(DataOutput out) throws IOException {

 Text.writeString(out, this.CONTRACT_NUMBER);


 }

 public void readFields(ResultSet in_set) throws SQLException {

 this.CONTRACT_NUMBER = in_set.getString(1);

 }

 @Override
 public void write(PreparedStatement prep_st) throws SQLException {
 // TODO Auto-generated method stub

 }

 }

 public static class LoadMapper extends MapReduceBase
 implements MapperLongWritable,
 ose_epr_contract, Text, NullWritable {
 private static final char FIELD_SEPARATOR = 1;

 public void map(LongWritable arg0, ose_epr_contract arg1,
 OutputCollectorText, NullWritable arg2, Reporter arg3)
 

Value-Only Reduce Output

2009-02-03 Thread Jack Stahl
Hello,

I'm interested in a map-reduce flow where I output only values (no keys) in
my reduce step.  For example, imagine the canonical word-counting program
where I'd like my output to be an unlabeled histogram of counts instead of
(word, count) pairs.

I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
my python scripts).  When I simulate the map reduce using pipes and sort in
bash, it works fine.   However, in Hadoop, if I output a value with no tabs,
Hadoop appends a trailing \t, apparently interpreting my output as a
(value, ) KV pair.  I'd like to avoid outputing this trailing tab if
possible.

Is there a command line option that could be use to effect this?  More
generally, is there something wrong with outputing arbitrary strings,
instead of key-value pairs, in your reduce step?


Re: hadoop dfs -test question (with a bit o' Ruby)

2009-02-03 Thread Norbert Burger
I'm no Ruby programmer, but don't you need a call to system() instead of the
backtick operator here?  Appears that the backtick operator returns STDOUT
instead of the return value:

http://hans.fugal.net/blog/2007/11/03/backticks-2-0

Norbert

On Tue, Feb 3, 2009 at 6:03 PM, S D sd.codewarr...@gmail.com wrote:

 Coincidentally I'm aware of the AWS::S3 package in Ruby but I'd prefer to
 avoid that...

 On Tue, Feb 3, 2009 at 5:02 PM, S D sd.codewarr...@gmail.com wrote:

  I'm at my wit's end. I want to do a simple test for the existence of a
 file
  on Hadoop. Here is the Ruby code I'm trying:
 
  val = `hadoop dfs -test -e s3n://holeinthebucket/user/hadoop/file.txt`
  puts Val: #{val}
  if val == 1
// do one thing
  else
// do another
  end
 
  I never get  a return value for the -test command. Is there something I'm
  missing? How should the return value be retrieved?
 
  Thanks,
  SD
 



Re: Value-Only Reduce Output

2009-02-03 Thread jason hadoop
If you are using the standard TextOutputFormat, and the output collector is
passed a null for the value, there will not be a trailing tab character
added to the output line.

output.collect( key, null );
Will give you the behavior you are looking for if your configuration is as I
expect.

On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl j...@yelp.com wrote:

 Hello,

 I'm interested in a map-reduce flow where I output only values (no keys) in
 my reduce step.  For example, imagine the canonical word-counting program
 where I'd like my output to be an unlabeled histogram of counts instead of
 (word, count) pairs.

 I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
 my python scripts).  When I simulate the map reduce using pipes and sort in
 bash, it works fine.   However, in Hadoop, if I output a value with no
 tabs,
 Hadoop appends a trailing \t, apparently interpreting my output as a
 (value, ) KV pair.  I'd like to avoid outputing this trailing tab if
 possible.

 Is there a command line option that could be use to effect this?  More
 generally, is there something wrong with outputing arbitrary strings,
 instead of key-value pairs, in your reduce step?



Re: Value-Only Reduce Output

2009-02-03 Thread jason hadoop
Ooops, you are using streaming., and I am not familar.
As a terrible hack, you could set mapred.textoutputformat.separator to the
empty string, in your configuration.

On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop jason.had...@gmail.com wrote:

 If you are using the standard TextOutputFormat, and the output collector is
 passed a null for the value, there will not be a trailing tab character
 added to the output line.

 output.collect( key, null );
 Will give you the behavior you are looking for if your configuration is as
 I expect.


 On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl j...@yelp.com wrote:

 Hello,

 I'm interested in a map-reduce flow where I output only values (no keys)
 in
 my reduce step.  For example, imagine the canonical word-counting program
 where I'd like my output to be an unlabeled histogram of counts instead of
 (word, count) pairs.

 I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
 my python scripts).  When I simulate the map reduce using pipes and sort
 in
 bash, it works fine.   However, in Hadoop, if I output a value with no
 tabs,
 Hadoop appends a trailing \t, apparently interpreting my output as a
 (value, ) KV pair.  I'd like to avoid outputing this trailing tab if
 possible.

 Is there a command line option that could be use to effect this?  More
 generally, is there something wrong with outputing arbitrary strings,
 instead of key-value pairs, in your reduce step?





Re: Control over max map/reduce tasks per job

2009-02-03 Thread Bryan Duxbury

This sounds good enough for a JIRA ticket to me.
-Bryan

On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote:


Chris,

For my specific use cases, it would be best to be able to set N
mappers/reducers per job per node (so I can explicitly say, run at  
most 2 at
a time of this CPU bound task on any given node).  However, the  
other way
would work as well (on 10 node system, would set job to max 20  
tasks at a
time globally), but opens up the possibility that a node could be  
assigned

more than 2 of that task.

I would work with whatever is easiest to implement as either would  
be a vast
improvement for me (can run high numbers of network latency bound  
tasks

without fear of cpu bound tasks killing the cluster).

JG




-Original Message-
From: Chris K Wensel [mailto:ch...@wensel.net]
Sent: Tuesday, February 03, 2009 11:34 AM
To: core-user@hadoop.apache.org
Subject: Re: Control over max map/reduce tasks per job

Hey Jonathan

Are you looking to limit the total number of concurrent mapper/
reducers a single job can consume cluster wide, or limit the number
per node?

That is, you have X mappers/reducers, but only can allow N mappers/
reducers to run at a time globally, for a given job.

Or, you are cool with all X running concurrently globally, but  
want to

guarantee that no node can run more than N tasks from that job?

Or both?

just reconciling the conversation we had last week with this thread.

ckw

On Feb 3, 2009, at 11:16 AM, Jonathan Gray wrote:


All,



I have a few relatively small clusters (5-20 nodes) and am having
trouble
keeping them loaded with my MR jobs.



The primary issue is that I have different jobs that have  
drastically

different patterns.  I have jobs that read/write to/from HBase or
Hadoop
with minimal logic (network throughput bound or io bound), others

that

perform crawling (network latency bound), and one huge parsing
streaming job
(very CPU bound, each task eats a core).



I'd like to launch very large numbers of tasks for network latency
bound
jobs, however the large CPU bound job means I have to keep the max
maps
allowed per node low enough as to not starve the Datanode and
Regionserver.



I'm an HBase dev but not familiar enough with Hadoop MR code to even
know
what would be involved with implementing this.  However, in talking
with
other users, it seems like this would be a well-received option.



I wanted to ping the list before filing an issue because it seems

like

someone may have thought about this in the past.



Thanks.



Jonathan Gray



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/






Re: Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem

2009-02-03 Thread Alejandro Abdelnur
Mikhail,

You are right, please open a Jira on this.

Alejandro


On Wed, Jan 28, 2009 at 9:23 PM, Mikhail Yakshin
greycat.na@gmail.comwrote:

 Hi,

 We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm
 trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious
 problem I've got into that we're extensively using MultipleOutputs in
 our jobs dealing with sequence files that store Cascading's Tuples.

 Since Cascading 0.9, Tuples stopped being WritableComparable and
 implemented generic Hadoop serialization interface and framework.
 However, in Hadoop 0.19, MultipleOutputs require use of older
 WritableComparable interface. Thus, trying to do something like:

 MultipleOutputs.addNamedOutput(conf, output-name,
 MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class);
 mos = new MultipleOutputs(conf);
 ...
 mos.getCollector(output-name, reporter).collect(tuple1, tuple2);

 yields an error:

 java.lang.RuntimeException: java.lang.RuntimeException: class
 cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476)
at my.namespace.MyReducer.reduce(MyReducer.java:xxx)

 Is there any known workaround for that? Any progress going on to make
 MultipleOutputs use generic Hadoop serialization?

 --
 WBR, Mikhail Yakshin



easy way to create IntelliJ project for Hadoop trunk?

2009-02-03 Thread Alejandro Abdelnur
With ivy fetching JARs as part of the build and several src dirs creating an
IDE (IntelliJ in my case) project becomes a pain.

Any easy way of doing it?

Thxs.

Alejandro