A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
Hi,
I tried to use SequenceFileInputFormat, for this I appended SEQ as first
bytes of my binary files (with hex editor).
but I get this exception:

A record version mismatch occured. Expecting v6, found v32
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

What could it be? Is it not enough just to add SEQ to binary files?
I use Hadoop v.0.19.0 .

Thanks in advance..
Rasit


different *version* of *Hadoop* between your server and your client.

-- 
M. Raşit ÖZDAŞ


Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
Is there any reason why it has to be a single SequenceFile? You could
write a local program to write several block compressed SequenceFiles
in parallel (to HDFS), each containing a portion of the files on your
PC.

Tom

On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Truly, I do not see any advantage to doing this, as opposed to writing
 (Java) code which will copy files to HDFS, because then tarring becomes my
 bottleneck. Unless I write code measure the file sizes and prepare pointers
 for multiple tarring tasks. It becomes pretty complex though, and I thought
 of something simple. I might as well accept that copying one hard drive to
 HDFS is not going to be parallelized.
 Mark

 On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
 f...@infochimps.orgwrote:

 Could you tar.bz2 them up (setting up the tar so that it made a few dozen
 files), toss them onto the HDFS, and use
 http://stuartsierra.com/2008/04/24/a-million-little-files
 to go into SequenceFile?

 This lets you preserve the originals and do the sequence file conversion
 across the cluster. It's only really helpful, of course, if you also want
 to
 prepare a .tar.bz2 so you can clear out the sprawl

 flip

 On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Hi,
 
  I am writing an application to copy all files from a regular PC to a
  SequenceFile. I can surely do this by simply recursing all directories on
  my
  PC, but I wonder if there is any way to parallelize this, a MapReduce
 task
  even. Tom White's books seems to imply that it will have to be a custom
  application.
 
  Thank you,
  Mark
 



 --
 http://www.infochimps.org
 Connected Open Free Data




Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
No, no reason for a single file - just a little simpler to think about. By
the way, can multiple MapReduce workers read the same SequenceFile
simultaneously?

On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote:

 Is there any reason why it has to be a single SequenceFile? You could
 write a local program to write several block compressed SequenceFiles
 in parallel (to HDFS), each containing a portion of the files on your
 PC.

 Tom

 On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Truly, I do not see any advantage to doing this, as opposed to writing
  (Java) code which will copy files to HDFS, because then tarring becomes
 my
  bottleneck. Unless I write code measure the file sizes and prepare
 pointers
  for multiple tarring tasks. It becomes pretty complex though, and I
 thought
  of something simple. I might as well accept that copying one hard drive
 to
  HDFS is not going to be parallelized.
  Mark
 
  On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
  f...@infochimps.orgwrote:
 
  Could you tar.bz2 them up (setting up the tar so that it made a few
 dozen
  files), toss them onto the HDFS, and use
  http://stuartsierra.com/2008/04/24/a-million-little-files
  to go into SequenceFile?
 
  This lets you preserve the originals and do the sequence file conversion
  across the cluster. It's only really helpful, of course, if you also
 want
  to
  prepare a .tar.bz2 so you can clear out the sprawl
 
  flip
 
  On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
  wrote:
 
   Hi,
  
   I am writing an application to copy all files from a regular PC to a
   SequenceFile. I can surely do this by simply recursing all directories
 on
   my
   PC, but I wonder if there is any way to parallelize this, a MapReduce
  task
   even. Tom White's books seems to imply that it will have to be a
 custom
   application.
  
   Thank you,
   Mark
  
 
 
 
  --
  http://www.infochimps.org
  Connected Open Free Data
 
 



Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Tom White
The SequenceFile format is described here:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html.
The format of the keys and values depends on the serialization classes
used. For example, BytesWritable writes out the length of its byte
array followed by the actual bytes in the array (see the write()
method in BytesWritable).

Hope this helps.
Tom

On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote:
 I tried to use SequenceFile.Writer to convert my binaries into Sequence
 Files,
 I read the binary data with FileInputStream, getting all bytes with
 reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
 parameters NullWritable as key, BytesWritable as value. But the content
 changes,
 (I can see that by converting to Base64)

 Binary File:
 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65
 103 54 81 65 65 97 81 65 65 65 81 ...

 Sequence File:
 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67
 69 77 65 52 80 86 67 65 73 68 114 ...

 Thanks for any points..
 Rasit

 2009/2/2 Rasit OZDAS rasitoz...@gmail.com

 Hi,
 I tried to use SequenceFileInputFormat, for this I appended SEQ as first
 bytes of my binary files (with hex editor).
 but I get this exception:

 A record version mismatch occured. Expecting v6, found v32
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
 at
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)

 What could it be? Is it not enough just to add SEQ to binary files?
 I use Hadoop v.0.19.0 .

 Thanks in advance..
 Rasit


 different *version* of *Hadoop* between your server and your client.

 --
 M. Raşit ÖZDAŞ




 --
 M. Raşit ÖZDAŞ



Re: settin JAVA_HOME...

2009-02-02 Thread Sandy
It's exactly as Steve says. Sorry, I should have been clearer in my last
e-mail. I have had really bad experiences with any other jdk (the default on
ubuntu, gcj, etc) than Sun's. So it may be easier to do that.

To stop all hadoop processes use:
bin/stop-all.sh

To start them, use:
bin/start-all.sh

Whenever you make a change in your conf/hadoop-env.sh or
conf/hadoop-site.xml files, you will need to restart hadoop using the above
two scripts.

All the best,

-SM

On Mon, Feb 2, 2009 at 4:40 AM, Steve Loughran ste...@apache.org wrote:

 haizhou zhao wrote:

 hi Sandy,
 Every time I change the conf, i have to do the following to things:
 1. kill all hadoop processes
 2. manually delelte all the file under hadoop.tmp.dir
 to make sure hadoop runs correctly, otherwise it wont work.

 Is this cause'd by my using a JDK instead of sun java?



 No, you need to do that to get configuration changes picked up. There are
 scripts in hadoop/bin to help you

  and what do you mean
 by sun-java, please?



 Sandy means

 * sun-java6-jdk: Sun's released JDK
 * default-jdk ubuntu chooses. On 8.10, it is open-jdk
 * open-jdk-6-jdk: the full open source version of the JDK. Worse font
 rendering code, but comes with more source

 Others
 * Oracle JRockit: good 64-bit memory management, based on the sun JDK
 unsupported
 * IBM JVM unsupported. Based on the sun JDK
 * Apache Harmony: clean room rewrite of everything. unsupported
 * Kaffe. unsupported
 * Gcj. unsupported

 type java -version to get your java version


 Sun

 java version 1.6.0_10
 Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
 Java HotSpot(TM) Server VM (buld 11.0-b14, mixed mode

 JRockit:

 java version 1.6.0_02
 Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
 BEA JRockit(R) (build R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64,
 compiled mode)








 2009/1/31 Sandy snickerdoodl...@gmail.com

  Hi Zander,

 Do not use jdk. Horrific things happen. You must use sun java in order to
 use hadoop.






SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-02 Thread Brian Long
Let me rephrase this problem... as stated below, when I start writing to a
SequenceFile from an HDFS client, nothing is visible in HDFS until I've
written 64M of data. This presents three problems: fsck reports the file
system as corrupt until the first block is finally written out, the presence
of the file (without any data) seems to blow up my mapred jobs that try to
make use of it under my input path, and finally, I want to basically flush
every 15 minutes or so so I can mapred the latest data.
I don't see any programmatic way to force the file to flush in 17.2.
Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that
not do what I think it does? What controls the 64M limit, anyway? Is it
dfs.checkpoint.size or dfs.block.size? Is the buffering happening on the
client, or on data nodes? Or in the namenode?

It seems really bad that a SequenceFile, upon creation, is in an unusable
state from the perspective of a mapred job, and also leaves fsck in a
corrupt state. Surely I must be doing something wrong... but what? How can I
ensure that a SequenceFile is immediately usable (but empty) on creation,
and how can I make things flush on some regular time interval?

Thanks,
Brian


On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote:

 I have a SequenceFile.Writer that I obtained via SequenceFile.createWriter
 and write to using append(key, value). Because the writer volume is low,
 it's not uncommon for it to take over a day for my appends to finally be
 flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
 Because I am running map/reduce tasks on this data multiple times a day, I
 want to flush the sequence file so the mapred jobs can pick it up when
 they run.
 What's the right way to do this? I'm assuming it's a fairly common use
 case. Also -- are writes to the sequence files atomic? (e.g. if I am
 actively appending to a sequence file, is it always safe to read from that
 same file in a mapred job?)

 To be clear, I want the flushing to be time based (controlled explicitly by
 the app), not size based. Will this create waste in HDFS somehow?

 Thanks,
 Brian




Re: settin JAVA_HOME...

2009-02-02 Thread Steve Loughran

haizhou zhao wrote:

hi Sandy,
Every time I change the conf, i have to do the following to things:
1. kill all hadoop processes
2. manually delelte all the file under hadoop.tmp.dir
to make sure hadoop runs correctly, otherwise it wont work.

Is this cause'd by my using a JDK instead of sun java? 



No, you need to do that to get configuration changes picked up. There 
are scripts in hadoop/bin to help you



and what do you mean
by sun-java, please?



Sandy means

* sun-java6-jdk: Sun's released JDK
* default-jdk ubuntu chooses. On 8.10, it is open-jdk
* open-jdk-6-jdk: the full open source version of the JDK. Worse font 
rendering code, but comes with more source


Others
* Oracle JRockit: good 64-bit memory management, based on the sun JDK 
unsupported

* IBM JVM unsupported. Based on the sun JDK
* Apache Harmony: clean room rewrite of everything. unsupported
* Kaffe. unsupported
* Gcj. unsupported

type java -version to get your java version


Sun

java version 1.6.0_10
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Server VM (buld 11.0-b14, mixed mode

JRockit:

java version 1.6.0_02
Java(TM) SE Runtime Environment (build 1.6.0_02-b05)
BEA JRockit(R) (build 
R27.4.0-90-89592-1.6.0_02-20070928-1715-linux-x86_64, compiled mode)









2009/1/31 Sandy snickerdoodl...@gmail.com


Hi Zander,

Do not use jdk. Horrific things happen. You must use sun java in order to
use hadoop.





Re: problem with completion notification from block movement

2009-02-02 Thread Karl Kleinpaste
On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote:
 The Datanode's use multiple threads with locking and one of the
 assumptions is that the block report (1ce per hour by default) takes
 little time. The datanode will pause while the block report is running
 and if it happens to take a while weird things start to happen.

Thank you for responding, this is very informative for us.

Having looked through the source code with a co-worker regarding
periodic scan and then checking the logs once again, we find that we are
finding reports of this sort:

BlockReport of 1158499 blocks got processed in 308860 msecs
BlockReport of 1159840 blocks got processed in 237925 msecs
BlockReport of 1161274 blocks got processed in 177853 msecs
BlockReport of 1162408 blocks got processed in 285094 msecs
BlockReport of 1164194 blocks got processed in 184478 msecs
BlockReport of 1165673 blocks got processed in 226401 msecs

The 3rd of these exactly straddles the particular example timeline I
discussed in my original email about this question.  I suspect I'll find
more of the same as I look through other related errors.

--karl



Re: MapFile.Reader and seek

2009-02-02 Thread Tom White
You can use the get() method to seek and retrieve the value. It will
return null if the key is not in the map. Something like:

Text value = (Text) indexReader.get(from, new Text());
while (value != null  ...)

Tom

On Thu, Jan 29, 2009 at 10:45 PM, schnitzi
mark.schnitz...@fastsearch.com wrote:

 Greetings all...  I have a situation where I want to read a range of keys and
 values out of a MapFile.  So I have something like this:

MapFile.Reader indexReader = new MapFile.Reader(fs, path.toString(),
 configuration)
boolean seekSuccess = indexReader.seek(from);
boolean readSuccess = indexReader.next(keyValue, value);
while (readSuccess  ...)

 The problem seems to be that while seekSuccess is returning true, when I
 call next() to get the value there, it's returning the value *after* the key
 that I called seek() on.  So if, say, my keys are Text(id0) through
 Text(id9), and I seek for Text(id3), calling next() will return
 Text(id4) and its associated value, not Text(id3).

 I would expect next() to return the key/value at the seek location, not the
 one after it.  Am I doing something wrong?  Otherwise, what good is seek(),
 really?
 --
 View this message in context: 
 http://www.nabble.com/MapFile.Reader-and-seek-tp21737717p21737717.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Mark Kerzner
Truly, I do not see any advantage to doing this, as opposed to writing
(Java) code which will copy files to HDFS, because then tarring becomes my
bottleneck. Unless I write code measure the file sizes and prepare pointers
for multiple tarring tasks. It becomes pretty complex though, and I thought
of something simple. I might as well accept that copying one hard drive to
HDFS is not going to be parallelized.
Mark

On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
f...@infochimps.orgwrote:

 Could you tar.bz2 them up (setting up the tar so that it made a few dozen
 files), toss them onto the HDFS, and use
 http://stuartsierra.com/2008/04/24/a-million-little-files
 to go into SequenceFile?

 This lets you preserve the originals and do the sequence file conversion
 across the cluster. It's only really helpful, of course, if you also want
 to
 prepare a .tar.bz2 so you can clear out the sprawl

 flip

 On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Hi,
 
  I am writing an application to copy all files from a regular PC to a
  SequenceFile. I can surely do this by simply recursing all directories on
  my
  PC, but I wonder if there is any way to parallelize this, a MapReduce
 task
  even. Tom White's books seems to imply that it will have to be a custom
  application.
 
  Thank you,
  Mark
 



 --
 http://www.infochimps.org
 Connected Open Free Data



Re: job management in Hadoop

2009-02-02 Thread Bill Au
Thanks.  I see that ACL is implemented in 0.19.0.  I think that's only for
job management from the command line, right?  Is there any ACL for
the web interface?

Bill

On Fri, Jan 30, 2009 at 6:23 PM, Bhupesh Bansal bban...@linkedin.comwrote:

 Bill,

 Currently you can kill the job from the UI.
 You have to enable the config in hadoop-default.xml

  namewebinterface.private.actions/name to be true

 Best
 Bhupesh


 On 1/30/09 3:23 PM, Bill Au bill.w...@gmail.com wrote:

  Thanks.
 
  Anyone knows if there is plan to add this functionality to the web UI
 like
  job priority can be changed from both the command line and the web UI?
 
  Bill
 
  On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy a...@yahoo-inc.com
 wrote:
 
 
  On Jan 30, 2009, at 2:41 PM, Bill Au wrote:
 
   Is there any way to cancel a job after it has been submitted?
 
 
  bin/hadoop job -kill jobid
 
  Arun
 




Re: HDFS Appends in 0.19.0

2009-02-02 Thread Craig Macdonald

Hi Arifa,

The O_APPEND flag is the subject of
https://issues.apache.org/jira/browse/HADOOP-4494

Craig

Arifa Nisar wrote:

Hello All,

I am using hadoop 0.19.0, whose release notes includes HADOOP-1700
Introduced append operation for HDFS files. I am trying to test this new
feature using my test program. I have experienced that O_APPEND flag added
in hdfsopen() is ignored by libhdfs. Also, only WRONLY and RDONLY are
defined in hdfs.h. Please let me know how to use append functionality in
this release.

Thanks,
Arifa.

  




Re: decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)

2009-02-02 Thread Bill Au
It looks like the behavior is the same with 0.18.2 and 0.19.0.  Even though
I removed the decommissioned node from the exclude file and run the
refreshNode command, the decommissioned node still show up as a dead node.
What I did noticed is that if I leave the decommissioned node in the exclude
and restart HDFS, the node will show up as a dead node after restart.  But
then if I remove it from the exclude file and run the refreshNode command,
it will disappear from the status page (dfshealth.jsp).

So it looks like I will have to stop and start the entire cluster in order
to get what I want.

Bill

On Thu, Jan 29, 2009 at 5:40 PM, Bill Au bill.w...@gmail.com wrote:

 Not sure why but this does not work for me.  I am running 0.18.2.  I ran
 hadoop dfsadmin -refreshNodes after removing the decommissioned node from
 the exclude file.  It still shows up as a dead node.  I also removed it from
 the slaves file and ran the refresh nodes command again.  It still shows up
 as a dead node after that.

 I am going to upgrade to 0.19.0 to see if it makes any difference.

 Bill


 On Tue, Jan 27, 2009 at 7:01 PM, paul paulg...@gmail.com wrote:

 Once the nodes are listed as dead, if you still have the host names in
 your
 conf/exclude file, remove the entries and then run hadoop dfsadmin
 -refreshNodes.


 This works for us on our cluster.



 -paul


 On Tue, Jan 27, 2009 at 5:08 PM, Bill Au bill.w...@gmail.com wrote:

  I was able to decommission a datanode successfully without having to
 stop
  my
  cluster.  But I noticed that after a node has been decommissioned, it
 shows
  up as a dead node in the web base interface to the namenode (ie
  dfshealth.jsp).  My cluster is relatively small and losing a datanode
 will
  have performance impact.  So I have a need to monitor the health of my
  cluster and take steps to revive any dead datanode in a timely fashion.
  So
  is there any way to altogether get rid of any decommissioned datanode
  from
  the web interace of the namenode?  Or is there a better way to monitor
 the
  health of the cluster?
 
  Bill
 





Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
Yes. SequenceFile is splittable, which means it can be broken into
chunks, called splits, each of which can be processed by a separate
map task.

Tom

On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner markkerz...@gmail.com wrote:
 No, no reason for a single file - just a little simpler to think about. By
 the way, can multiple MapReduce workers read the same SequenceFile
 simultaneously?

 On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote:

 Is there any reason why it has to be a single SequenceFile? You could
 write a local program to write several block compressed SequenceFiles
 in parallel (to HDFS), each containing a portion of the files on your
 PC.

 Tom

 On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Truly, I do not see any advantage to doing this, as opposed to writing
  (Java) code which will copy files to HDFS, because then tarring becomes
 my
  bottleneck. Unless I write code measure the file sizes and prepare
 pointers
  for multiple tarring tasks. It becomes pretty complex though, and I
 thought
  of something simple. I might as well accept that copying one hard drive
 to
  HDFS is not going to be parallelized.
  Mark
 
  On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
  f...@infochimps.orgwrote:
 
  Could you tar.bz2 them up (setting up the tar so that it made a few
 dozen
  files), toss them onto the HDFS, and use
  http://stuartsierra.com/2008/04/24/a-million-little-files
  to go into SequenceFile?
 
  This lets you preserve the originals and do the sequence file conversion
  across the cluster. It's only really helpful, of course, if you also
 want
  to
  prepare a .tar.bz2 so you can clear out the sprawl
 
  flip
 
  On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
  wrote:
 
   Hi,
  
   I am writing an application to copy all files from a regular PC to a
   SequenceFile. I can surely do this by simply recursing all directories
 on
   my
   PC, but I wonder if there is any way to parallelize this, a MapReduce
  task
   even. Tom White's books seems to imply that it will have to be a
 custom
   application.
  
   Thank you,
   Mark
  
 
 
 
  --
  http://www.infochimps.org
  Connected Open Free Data
 
 




Re: Hadoop Streaming Semantics

2009-02-02 Thread S D
Thanks for your response. I'm using version 0.19.0 of Hadoop.
I tried your suggestion. Here is the line I use to invoke Hadoop

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\
   -input /user/hadoop/hadoop-input/inputFile.txt \\
   -output /user/hadoop/hadoop-output \\
   -mapper map-script.sh \\
   -file map-script.sh \\
   -file additional-script.rb \\ # Called by map-script.sh
   -file utils.rb \\
   -file env.sh \\
   -file aws-s3-credentials-file \\# For permissions to use AWS::S3
   -jobconf mapred.reduce.tasks=0 \\
   -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Everything works fine if the -inputformat switch is not included but when I
include it I get the following message:
   ERROR streaming.StreamJob: Job not Successful!
and a Runtime exception shows up in the jobtracker log:
   PipeMapRed.waitOutputThreads(): subprocess failed with code 1

My map functions read each line of the input file and create a directory
(one for each line) on Hadoop (in our case S3 Native) in which corresponding
data is produced and stored. The name of the created directories are based
on the contents of the corresponding line. When I include the -inputformat
line above I've noticed that instead of the directories I'm expecting (named
after the data found in the input file), the directories are given seemingly
arbitrary numeric names; e.g., when the input file contained four lines of
data, the directories were named: 0, 273, 546 and 819.

Any thoughts?

John

On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu 
amar...@yahoo-inc.com wrote:

 Which version of hadoop are you using?

 You can directly use -inputformat
 org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You
 need not include it in your streaming jar.
 -Amareshwari


 S D wrote:

 Thanks for your response Amereshwari. I'm unclear on how to take advantage
 of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
 streaming jar file (contrib/streaming/hadoop-version-streaming.jar) to
 include the NLineInputFormat class and then pass a command line
 configuration param to indicate that NLineInputFormat should be used? If
 this is the proper approach, can you point me to an example of what kind
 of
 param should be specified? I appreciate your help.

 Thanks,
 SD

 On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu 
 amar...@yahoo-inc.com wrote:



 You can use NLineInputFormat for this, which splits one line (N=1, by
 default) as one split.
 So, each map task processes one line.
 See

 http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

 -Amareshwari

 S D wrote:



 Hello,

 I have a clarifying question about Hadoop streaming. I'm new to the list
 and
 didn't see anything posted that covers my questions - my apologies if I
 overlooked a relevant post.

 I have an input file consisting of a list of files (one per line) that
 need
 to be processed independently of each other. The duration for processing
 each file is significant - perhaps an hour each. I'm using Hadoop
 streaming
 without a reduce function to process each file and save the results
 (back
 to
 S3 native in my case). To handle to long processing time of each file
 I've
 set mapred.task.timeout=0 and I have a pretty straight forward Ruby
 script
 reading from STDIN:

 STDIN.each_line do |line|
  # Get file from contents of line
  # Process file (long running)
 end

 Currently I'm using a cluster of 3 workers in which each worker can have
 up
 to 2 tasks running simultaneously. I've noticed that if I have a single
 input file with many lines (more than 6 given my cluster), then not all
 workers will be allocated tasks; I've noticed two workers being
 allocated
 one task each and the other worker sitting idly. If I split my input
 file
 into multiple files (at least 6) then all workers will be immediately
 allocated the maximum number of tasks that they can handle.

 My interpretation on this is fuzzy. It seems that Hadoop streaming will
 take
 separate input files and allocate a new task per file (up to the maximum
 constraint) but if given a single input file it is unclear as to whether
 a
 new task is allocated per file or line. My understanding of Hadoop Java
 is
 that (unlike Hadoop streaming) when given a single input file, the file
 will
 be broken up into separate lines and the maximum number of map tasks
 will
 automagically be allocated to handle the lines of the file (assuming the
 use
 of TextInputFormat).

 Can someone clarify this?

 Thanks,
 SD














RE: HDFS Appends in 0.19.0

2009-02-02 Thread Arifa Nisar
Does that mean libhdfs doesn't support append functionality in 0.19.0, but
if I write a Java test program to test hdfs append functionality then it
should work? Do I need to apply all the patches given at
https://issues.apache.org/jira/browse/HADOOP-1700 to test append
functionality using TestFileAppend*.java? Or just having hadoop 0.19.0
should work?

Thanks,
Arifa. 

 -Original Message-
From: Craig Macdonald [mailto:cra...@dcs.gla.ac.uk] 
Sent: Monday, February 02, 2009 3:58 PM
To: core-user@hadoop.apache.org
Subject: Re: HDFS Appends in 0.19.0

Hi Arifa,

The O_APPEND flag is the subject of
https://issues.apache.org/jira/browse/HADOOP-4494

Craig

Arifa Nisar wrote:
 Hello All,

 I am using hadoop 0.19.0, whose release notes includes HADOOP-1700
 Introduced append operation for HDFS files. I am trying to test this new
 feature using my test program. I have experienced that O_APPEND flag added
 in hdfsopen() is ignored by libhdfs. Also, only WRONLY and RDONLY are
 defined in hdfs.h. Please let me know how to use append functionality in
 this release.

 Thanks,
 Arifa.

   



Hadoop User Group UK Meetup - April 14th

2009-02-02 Thread Johan Oskarsson
I've started organizing the next Hadoop meetup in London, UK. The date
is April 14th and the presentations so far include:

Michael Stack (Powerset): Apache HBase
Isabel Drost (Neofonie): Introducing Apache Mahout
Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier
Paolo Castagna (HP): Having Fun with PageRank and MapReduce

Keep an eye on the blog for updates: http://huguk.org/

Help in the form of sponsoring (venue, beer etc) would be much
appreciated. Also let me know if you want to present. Personally I'd
love to see presentations from other Hadoop related projects (pig, hive,
hama etc).

/Johan


Hadoop's reduce tasks are freezes at 0%.

2009-02-02 Thread Kwang-Min Choi
I'm newbie in Hadoop.
and i'm trying to follow Hadoop Quick Guide at hadoop homepage.
but, there are some problems...
Downloading, unzipping hadoop is done.
and ssh successfully operate without password phrase.

once... I execute grep example attached to Hadoop...
map task is ok. it reaches 100%.
but reduce task freezes at 0% without any error message.
I've waited it for more than 1 hour, but it still freezes...
same job in standalone mode is well done...

i tried it with version 0.18.3 and 0.17.2.1.
all of them had same problem.

could help me to solve this problem?

Additionally...
I'm working on cloud-infra of GoGrid(Redhat).
So, disk's space  health is OK.
and, i've installed JDK 1.6.11 for linux successfully.


- KKwams 


hadoop to ftp files into hdfs

2009-02-02 Thread Steve Morin
Does any one have a good suggestion on how to submit a hadoop job that
will split the ftp retrieval of a number of files for insertion into
hdfs?  I have been searching google for suggestions on this matter.
Steve


Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
I tried to use SequenceFile.Writer to convert my binaries into Sequence
Files,
I read the binary data with FileInputStream, getting all bytes with
reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
parameters NullWritable as key, BytesWritable as value. But the content
changes,
(I can see that by converting to Base64)

Binary File:
73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65
103 54 81 65 65 97 81 65 65 65 81 ...

Sequence File:
73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67
69 77 65 52 80 86 67 65 73 68 114 ...

Thanks for any points..
Rasit

2009/2/2 Rasit OZDAS rasitoz...@gmail.com

 Hi,
 I tried to use SequenceFileInputFormat, for this I appended SEQ as first
 bytes of my binary files (with hex editor).
 but I get this exception:

 A record version mismatch occured. Expecting v6, found v32
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
 at
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)

 What could it be? Is it not enough just to add SEQ to binary files?
 I use Hadoop v.0.19.0 .

 Thanks in advance..
 Rasit


 different *version* of *Hadoop* between your server and your client.

 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ


Scale Unlimited Professionals Program

2009-02-02 Thread Chris K Wensel

Hey All

Just wanted to let everyone know that Scale Unlimited will start  
offering many of its courses heavily discounted, if not free, to  
independent consultants and contractors.

http://www.scaleunlimited.com/programs

We are doing this because we receive a number of consulting/ 
contracting opportunities that we wish to delegate back to trusted  
consultants and developers.


But more importantly, many consultants don't have time to learn Hadoop  
and related technologies, so Hadoop is often overlooked on new  
projects. We would like to get more developers comfortable knowing  
when and when not to use Hadoop on a project, ultimately leading to  
more projects using Hadoop and to Hadoop becoming more stable and  
feature rich in the process.


We plan to offer our Hadoop Boot Camp for FREE in the Bay Area in the  
next few weeks. If interested in participating, email me directly.

http://www.scaleunlimited.com/courses/hadoop-boot-camp

Note this offer is limited to professional independent consultants,  
contractors, and small boutique contracting firms that are looking to  
expand their tool base. We only ask for an industry standard referral  
fee for any projects that result from a referral, if any.


To be added to our referral list or if you have a project that might  
benefit from Hadoop or related technologies, please email me directly.


This course will also be announced for open public enrollment in the  
coming days.


cheers,
chris

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Transferring data between different Hadoop clusters

2009-02-02 Thread Taeho Kang
Thanks for your prompt reply.

When using the command
./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path

- Should this command be given in cluster1?
- What does port 50070 specify? Is it the one in fs.default.name, or
dfs.http.address?

/Taeho



On Mon, Feb 2, 2009 at 12:40 PM, Mark Chadwick mchadw...@invitemedia.comwrote:

 Taeho,

 The distcp command is perfect for this.  If you're copying between two
 clusters running the same version of Hadoop, you can do something like:

 ./bin/hadoop distcp hdfs://cluster1/path hdfs://cluster2/path

 If you're copying between 0.18 and 0.19, the command will look like:

 ./bin/hadoop distcp hftp://cluster1:50070/path hdfs://cluster2/path

 Hope that helps,
 -Mark

 On Sun, Feb 1, 2009 at 9:48 PM, Taeho Kang tka...@gmail.com wrote:

  Dear all,
 
  There have been times where I needed to transfer some big data from one
  version of Hadoop cluster to another.
  (e.g. from hadoop 0.18 to hadoop 0.19 cluster)
 
  Other than copying files from one cluster to a local file system and
 upload
  it to another,
  is there a tool that does it?
 
  Thanks in advance,
  Regards,
 
  /Taeho
 



My tasktrackers keep getting lost...

2009-02-02 Thread Ian Soboroff

I hope someone can help me out.  I'm getting started with Hadoop, 
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.

I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get lost to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed 
completed task 'attempt_200902021252_0002_r_05_1' from 
'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: 
attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching 
task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_cheyenne:localhost.localdomain/127.0.0.1:52769'; resending the 
previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_tigris:localhost.localdomain/127.0.0.1:52808'; resending the previous 
'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_monocacy:localhost.localdomain/127.0.0.1:54464'; Resending the 
previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744'; 
resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 
'duplicate' heartbeat from 
'tracker_rhone:localhost.localdomain/127.0.0.1:45749'; resending the previous 
'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 54311 caught: java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
at org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
at org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status from 
unknown Tracker : tracker_monocacy:localhost.localdomain/127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: 
src: 129.6.101.18:50060, dest: 129.6.101.12:37304, bytes: 6, op: 
MAPRED_SHUFFLE, cliID: attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught 
exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on 
local exception: null
at org.apache.hadoop.ipc.Client.call(Client.java:699)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at 
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
at 
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1678)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2698)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
at 

Re: HDFS issues in 0.17.2.1 and 0.19.0 versions

2009-02-02 Thread Konstantin Shvachko

Are you sure you were using 0.19 not 0.20 ?

For 0.17 please check that configuration file hadoop-site.xml exists
in your configuration directory is not empty and points to hdfs rather
than local file system, which it does buy default.
In 0.17 all config variables have been in a common file. 0.19 was the same.
0.20 changed it so now we have hdfs-site.xml, core-site.xml, mapred-site.xml
See
https://issues.apache.org/jira/browse/HADOOP-4631

Hope this helps.
--Konstantin

Shyam Sarkar wrote:

Hello,

I am trying to understand the clustering inside 0.17.2.1 as opposed to
0.19.0 versions. I am trying to
create a directory inside 0.17.2.1  HDFS but it creates in Linux FS.
However, I can do that in 0.19.0
without any problem.

Can someone suggest what should I do for 0.17.2.1 so that I can create
directory in HDFS?

Thanks,
shyam.s.sar...@gmail.com



Book: Hadoop-The Definitive Guide

2009-02-02 Thread Mark Kerzner
Hi,

I am going through examples in this book (which I have obtained as early
draft from Safari), and they all work, with occasional fixes. However, the
SequenceFileWriteDemo, even though it works without an error, does not show
the create file when I use this command

hadoop fs -ls /

I remember reading somewhere that the file needs to be at least 64 M to be
seen, or something to that effect. How can I see the created file?

If you want the code (with a few minor changes)

public class SequenceFileWriteDemo {

private static final String[] DATA = {
One, two, buckle my shoe,
Three, four, shut the door,
Five, six, pick up sticks,
Seven, eight, lay them straight,
Nine, ten, a big fat hen
};

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);

IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());

int n = 1;
for (int i = 0; i  n; i++) {
key.set(n - i);
value.set(DATA[i % DATA.length]);
if (i % 100 == 0) {
System.out.printf([%s]\t%s\t%s\n, writer.getLength(),
key, value);
}
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}


Thank you,
Mark


Re: Hadoop Streaming Semantics

2009-02-02 Thread Amareshwari Sriramadasu

S D wrote:

Thanks for your response. I'm using version 0.19.0 of Hadoop.
I tried your suggestion. Here is the line I use to invoke Hadoop

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar \\
   -input /user/hadoop/hadoop-input/inputFile.txt \\
   -output /user/hadoop/hadoop-output \\
   -mapper map-script.sh \\
   -file map-script.sh \\
   -file additional-script.rb \\ # Called by map-script.sh
   -file utils.rb \\
   -file env.sh \\
   -file aws-s3-credentials-file \\# For permissions to use AWS::S3
   -jobconf mapred.reduce.tasks=0 \\
   -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Everything works fine if the -inputformat switch is not included but when I
include it I get the following message:
   ERROR streaming.StreamJob: Job not Successful!
and a Runtime exception shows up in the jobtracker log:
   PipeMapRed.waitOutputThreads(): subprocess failed with code 1

My map functions read each line of the input file and create a directory
(one for each line) on Hadoop (in our case S3 Native) in which corresponding
data is produced and stored. The name of the created directories are based
on the contents of the corresponding line. When I include the -inputformat
line above I've noticed that instead of the directories I'm expecting (named
after the data found in the input file), the directories are given seemingly
arbitrary numeric names; e.g., when the input file contained four lines of
data, the directories were named: 0, 273, 546 and 819.

  

LineRecordReader reads line as VALUE and the KEY is offset in the file.
Looks like your directories are getting named with KEY. But I don't  see 
any reason for that, because it is working fine with TextInputFormat 
(both TextInFormat and NLineInputFormat use LineRecordReader.)


-Amareshwari

Any thoughts?

John

On Sun, Feb 1, 2009 at 11:00 PM, Amareshwari Sriramadasu 
amar...@yahoo-inc.com wrote:

  

Which version of hadoop are you using?

You can directly use -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat for your streaming job. You
need not include it in your streaming jar.
-Amareshwari


S D wrote:



Thanks for your response Amereshwari. I'm unclear on how to take advantage
of NLineInputFormat with Hadoop Streaming. Is the idea that I modify the
streaming jar file (contrib/streaming/hadoop-version-streaming.jar) to
include the NLineInputFormat class and then pass a command line
configuration param to indicate that NLineInputFormat should be used? If
this is the proper approach, can you point me to an example of what kind
of
param should be specified? I appreciate your help.

Thanks,
SD

On Thu, Jan 29, 2009 at 10:49 PM, Amareshwari Sriramadasu 
amar...@yahoo-inc.com wrote:



  

You can use NLineInputFormat for this, which splits one line (N=1, by
default) as one split.
So, each map task processes one line.
See

http://hadoop.apache.org/core/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html

-Amareshwari

S D wrote:





Hello,

I have a clarifying question about Hadoop streaming. I'm new to the list
and
didn't see anything posted that covers my questions - my apologies if I
overlooked a relevant post.

I have an input file consisting of a list of files (one per line) that
need
to be processed independently of each other. The duration for processing
each file is significant - perhaps an hour each. I'm using Hadoop
streaming
without a reduce function to process each file and save the results
(back
to
S3 native in my case). To handle to long processing time of each file
I've
set mapred.task.timeout=0 and I have a pretty straight forward Ruby
script
reading from STDIN:

STDIN.each_line do |line|
 # Get file from contents of line
 # Process file (long running)
end

Currently I'm using a cluster of 3 workers in which each worker can have
up
to 2 tasks running simultaneously. I've noticed that if I have a single
input file with many lines (more than 6 given my cluster), then not all
workers will be allocated tasks; I've noticed two workers being
allocated
one task each and the other worker sitting idly. If I split my input
file
into multiple files (at least 6) then all workers will be immediately
allocated the maximum number of tasks that they can handle.

My interpretation on this is fuzzy. It seems that Hadoop streaming will
take
separate input files and allocate a new task per file (up to the maximum
constraint) but if given a single input file it is unclear as to whether
a
new task is allocated per file or line. My understanding of Hadoop Java
is
that (unlike Hadoop streaming) when given a single input file, the file
will
be broken up into separate lines and the maximum number of map tasks
will
automagically be allocated to handle the lines of the file (assuming the
use
of TextInputFormat).

Can someone clarify this?

Thanks,
SD





  



  



  




Re: Hadoop's reduce tasks are freezes at 0%.

2009-02-02 Thread jason hadoop
A reduce stall at 0% implies that the map tasks are not outputting any
records via the output collector.
You need to go look at the task tracker and the task logs on all of your
slave machines, to see if anything that seems odd appears in the logs.
On the tasktracker web interface detail screen for your job,
Are all of the map tasks finished
Are any of the map tasks started
Are there any Tasktracker nodes to service your job

On Sun, Feb 1, 2009 at 11:41 PM, Kwang-Min Choi kmbest.c...@samsung.comwrote:

 I'm newbie in Hadoop.
 and i'm trying to follow Hadoop Quick Guide at hadoop homepage.
 but, there are some problems...
 Downloading, unzipping hadoop is done.
 and ssh successfully operate without password phrase.

 once... I execute grep example attached to Hadoop...
 map task is ok. it reaches 100%.
 but reduce task freezes at 0% without any error message.
 I've waited it for more than 1 hour, but it still freezes...
 same job in standalone mode is well done...

 i tried it with version 0.18.3 and 0.17.2.1.
 all of them had same problem.

 could help me to solve this problem?

 Additionally...
 I'm working on cloud-infra of GoGrid(Redhat).
 So, disk's space  health is OK.
 and, i've installed JDK 1.6.11 for linux successfully.


 - KKwams



Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-02 Thread jason hadoop
If you have to do a time based solution, for now, simply close the file and
stage it, then open a new file.
Your reads will have to deal with the fact the file is in multiple parts.
Warning: Datanodes get pokey if they have large numbers of blocks, and the
quickest way to do this is to create a lot of small files.

On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote:

 Let me rephrase this problem... as stated below, when I start writing to a
 SequenceFile from an HDFS client, nothing is visible in HDFS until I've
 written 64M of data. This presents three problems: fsck reports the file
 system as corrupt until the first block is finally written out, the
 presence
 of the file (without any data) seems to blow up my mapred jobs that try to
 make use of it under my input path, and finally, I want to basically flush
 every 15 minutes or so so I can mapred the latest data.
 I don't see any programmatic way to force the file to flush in 17.2.
 Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that
 not do what I think it does? What controls the 64M limit, anyway? Is it
 dfs.checkpoint.size or dfs.block.size? Is the buffering happening on
 the
 client, or on data nodes? Or in the namenode?

 It seems really bad that a SequenceFile, upon creation, is in an unusable
 state from the perspective of a mapred job, and also leaves fsck in a
 corrupt state. Surely I must be doing something wrong... but what? How can
 I
 ensure that a SequenceFile is immediately usable (but empty) on creation,
 and how can I make things flush on some regular time interval?

 Thanks,
 Brian


 On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote:

  I have a SequenceFile.Writer that I obtained via
 SequenceFile.createWriter
  and write to using append(key, value). Because the writer volume is low,
  it's not uncommon for it to take over a day for my appends to finally be
  flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
  Because I am running map/reduce tasks on this data multiple times a day,
 I
  want to flush the sequence file so the mapred jobs can pick it up when
  they run.
  What's the right way to do this? I'm assuming it's a fairly common use
  case. Also -- are writes to the sequence files atomic? (e.g. if I am
  actively appending to a sequence file, is it always safe to read from
 that
  same file in a mapred job?)
 
  To be clear, I want the flushing to be time based (controlled explicitly
 by
  the app), not size based. Will this create waste in HDFS somehow?
 
  Thanks,
  Brian
 
 



Re: problem with completion notification from block movement

2009-02-02 Thread jason hadoop
This can be made significantly worse by your underlying host file system and
the disks that support it.
Disabling atime updates via noatime should by you an immediate 10% gain on
the block report time.
not using a raid 5 controller under the covers should by you a chunk too.

I haven't tried it but it may be that ext3 with file_type and dir_index
might be faster.
Disable journals on your file systems, or put the journal on a different
device.

The goal of the above set of operations is to try to make it faster for an
operation to get file system meta data for each file in a large directory.

Using the deadline IO scheduler might help, or otherwise tuning the OS level
access to prioritize small reads over large writes.

noatime reduces the number of writes being generated by the scan
no raid 5 reduces the number of reads needed for ancillary write operations.

In general the collected wisdom says to use multiple individual drives for
the block storage with a comma separated list for the dfs.data.dir
parameter, were each entry on the list is on a separate drive, that
preferably only does Datanode service.

propertynamedfs.data.dir/namevalue/drive1,/drive2,/drive3,/drive4/value/property

Before I left Attributor, there was a though of running a continous find on
the dfs.data.dir to try to force the kernel to keep the inodes in memory,
but I think they abandoned that strategy.



On Mon, Feb 2, 2009 at 10:23 AM, Karl Kleinpaste k...@conviva.com wrote:

 On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote:
  The Datanode's use multiple threads with locking and one of the
  assumptions is that the block report (1ce per hour by default) takes
  little time. The datanode will pause while the block report is running
  and if it happens to take a while weird things start to happen.

 Thank you for responding, this is very informative for us.

 Having looked through the source code with a co-worker regarding
 periodic scan and then checking the logs once again, we find that we are
 finding reports of this sort:

 BlockReport of 1158499 blocks got processed in 308860 msecs
 BlockReport of 1159840 blocks got processed in 237925 msecs
 BlockReport of 1161274 blocks got processed in 177853 msecs
 BlockReport of 1162408 blocks got processed in 285094 msecs
 BlockReport of 1164194 blocks got processed in 184478 msecs
 BlockReport of 1165673 blocks got processed in 226401 msecs

 The 3rd of these exactly straddles the particular example timeline I
 discussed in my original email about this question.  I suspect I'll find
 more of the same as I look through other related errors.

 --karl




Re: hadoop to ftp files into hdfs

2009-02-02 Thread jason hadoop
If you have a large number of ftp urls spread across many sites, simply set
that file to be your hadoop job input, and force the input split to be a
size that gives you good distribution across your cluster.


On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote:

 Does any one have a good suggestion on how to submit a hadoop job that
 will split the ftp retrieval of a number of files for insertion into
 hdfs?  I have been searching google for suggestions on this matter.
 Steve



Re: My tasktrackers keep getting lost...

2009-02-02 Thread jason hadoop
When I was at Attributor we experienced periodic odd XFS hangs that would
freeze up the Hadoop Server processes resulting in them going away.
Sometimes XFS would deadlock all writes to the log file and the server would
freeze trying to log a message. Can't even JSTACK the jvm.
We never had any traction on resolving the XFS deadlocks and simply reboot
the machines when the problem occured.

On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote:


 I hope someone can help me out.  I'm getting started with Hadoop,
 have written the firt part of my project (a custom InputFormat), and am
 now using that to test out my cluster setup.

 I'm running 0.19.0.  I have five dual-core Linux workstations with most
 of a 250GB disk available for playing, and am controlling things from my
 Mac Pro.  (This is not the production cluster, that hasn't been
 assembled yet.  This is just to get the code working and figure out the
 bumps.)

 My test data is about 18GB of web pages, and the test app at the moment
 just counts the number of web pages in each bundle file.  The map jobs
 run just fine, but when it gets into the reduce, the TaskTrackers all
 get lost to the JobTracker.  I can't see why, because the TaskTrackers
 are all still running on the slaves.  Also, the jobdetails URL starts
 returning an HTTP 500 error, although other links from that page still
 work.

 I've tried going onto the slaves and manually restarting the
 tasktrackers with hadoop-daemon.sh, and also turning on job restarting
 in the site conf and then running stop-mapred/start-mapred.  The
 trackers start up and try to clean up and get going again, but they then
 just get lost again.

 Here's some error output from the master jobtracker:

 2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
 completed task 'attempt_200902021252_0002_r_05_1' from
 'tracker_darling:localhost.localdomain/127.0.0.1:58336'
 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
 task attempt_200902021252_0002_m_004592_1 timed out.
 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
 2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
 task attempt_200902021252_0002_m_004582_1 timed out.
 2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
 'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
 127.0.0.1:52769'; resending the previous 'lost' response
 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
 'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
 127.0.0.1:52808'; resending the previous 'lost' response
 2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
 'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
 127.0.0.1:54464'; Resending the previous 'lost' response
 2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
 'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
 resending the previous 'lost' response
 2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
 'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
 127.0.0.1:45749'; resending the previous 'lost' response
 2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 9
 on 54311 caught: java.lang.NullPointerException
at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
at
 org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
 :48)
at
 org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
 va:101)
at
 org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
 59)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

 2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
 from unknown Tracker : tracker_monocacy:localhost.localdomain/
 127.0.0.1:54464

 And from a slave:

 2009-02-02 13:26:39,440 INFO
 org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
 dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
 attempt_200902021252_0002_m_000111_0
 2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
 exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
 local exception: null
at org.apache.hadoop.ipc.Client.call(Client.java:699)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
at
 org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
at
 org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:997)

Re: My tasktrackers keep getting lost...

2009-02-02 Thread Sagar Naik

Can u post the output from
hadoop-argus-hostname-jobtracker.out

-Sagar

jason hadoop wrote:

When I was at Attributor we experienced periodic odd XFS hangs that would
freeze up the Hadoop Server processes resulting in them going away.
Sometimes XFS would deadlock all writes to the log file and the server would
freeze trying to log a message. Can't even JSTACK the jvm.
We never had any traction on resolving the XFS deadlocks and simply reboot
the machines when the problem occured.

On Mon, Feb 2, 2009 at 7:09 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

  

I hope someone can help me out.  I'm getting started with Hadoop,
have written the firt part of my project (a custom InputFormat), and am
now using that to test out my cluster setup.

I'm running 0.19.0.  I have five dual-core Linux workstations with most
of a 250GB disk available for playing, and am controlling things from my
Mac Pro.  (This is not the production cluster, that hasn't been
assembled yet.  This is just to get the code working and figure out the
bumps.)

My test data is about 18GB of web pages, and the test app at the moment
just counts the number of web pages in each bundle file.  The map jobs
run just fine, but when it gets into the reduce, the TaskTrackers all
get lost to the JobTracker.  I can't see why, because the TaskTrackers
are all still running on the slaves.  Also, the jobdetails URL starts
returning an HTTP 500 error, although other links from that page still
work.

I've tried going onto the slaves and manually restarting the
tasktrackers with hadoop-daemon.sh, and also turning on job restarting
in the site conf and then running stop-mapred/start-mapred.  The
trackers start up and try to clean up and get going again, but they then
just get lost again.

Here's some error output from the master jobtracker:

2009-02-02 13:39:40,904 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200902021252_0002_r_05_1' from
'tracker_darling:localhost.localdomain/127.0.0.1:58336'
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200902021252_0002_m_004592_1 is 796370 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
task attempt_200902021252_0002_m_004592_1 timed out.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker:
attempt_200902021252_0002_m_004582_1 is 794199 ms debug.
2009-02-02 13:39:40,905 INFO org.apache.hadoop.mapred.JobTracker: Launching
task attempt_200902021252_0002_m_004582_1 timed out.
2009-02-02 13:41:22,271 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_cheyenne:localhost.localdomain/
127.0.0.1:52769'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_tigris:localhost.localdomain/
127.0.0.1:52808'; resending the previous 'lost' response
2009-02-02 13:41:22,272 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_monocacy:localhost.localdomain/
127.0.0.1:54464'; Resending the previous 'lost' response
2009-02-02 13:41:22,298 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_129.6.101.41:127.0.0.1/127.0.0.1:58744';
resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
'duplicate' heartbeat from 'tracker_rhone:localhost.localdomain/
127.0.0.1:45749'; resending the previous 'lost' response
2009-02-02 13:41:22,421 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9
on 54311 caught: java.lang.NullPointerException
   at org.apache.hadoop.mapred.MapTask.write(MapTask.java:123)
   at
org.apache.hadoop.mapred.LaunchTaskAction.write(LaunchTaskAction.java
:48)
   at
org.apache.hadoop.mapred.HeartbeatResponse.write(HeartbeatResponse.ja
va:101)
   at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:1
59)
   at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:70)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:907)

2009-02-02 13:41:27,275 WARN org.apache.hadoop.mapred.JobTracker: Status
from unknown Tracker : tracker_monocacy:localhost.localdomain/
127.0.0.1:54464

And from a slave:

2009-02-02 13:26:39,440 INFO
org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 129.6.101.18:50060,
dest: 129.6.101.12:37304, bytes: 6, op: MAPRED_SHUFFLE, cliID:
attempt_200902021252_0002_m_000111_0
2009-02-02 13:41:40,165 ERROR org.apache.hadoop.mapred.TaskTracker: Caught
exception: java.io.IOException: Call to rogue/129.6.101.41:54311 failed on
local exception: null
   at org.apache.hadoop.ipc.Client.call(Client.java:699)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
   at
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1164)
   at

reading data from multiple output files into a single Map method.

2009-02-02 Thread some speed
Hi,


I am implementing a chain M-R job in java. If I am using multiple reducers,
then the output seems to be dispersed among several files on the dfs.
 How can I now read these files into the Map method of the next job?

Another doubt I have is ...Is it possible to keep appending to the same
output file while implementing an iterative M-R job? i.e. every MR job
append the result to the same output file.


Thanks,

Ketan


Re: reading data from multiple output files into a single Map method.

2009-02-02 Thread jason hadoop
Do you really want to have a single task process all of the reduce outputs?

If you want all of your output processed by a set of map tasks, you can set
the output directory of your previous job to be the input directory of your
next job, ensuring that the framework knows how to read the key value pairs
from your reduce output.

You could set the number of reduces to 1 in your original job, and you will
get a single output file from the single reduce task that is run.

On Mon, Feb 2, 2009 at 9:34 PM, some speed speed.s...@gmail.com wrote:

 Hi,


 I am implementing a chain M-R job in java. If I am using multiple reducers,
 then the output seems to be dispersed among several files on the dfs.
  How can I now read these files into the Map method of the next job?

 Another doubt I have is ...Is it possible to keep appending to the same
 output file while implementing an iterative M-R job? i.e. every MR job
 append the result to the same output file.


 Thanks,

 Ketan



Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
Thanks, Tom
The problem that content was different was that
I converted one sample to Base64 byte-by-byte, and converted the other
from-byte-array to-byte-array (Strange, that they cause different outputs).
Thanks for good points.

Rasit

2009/2/2 Tom White t...@cloudera.com

 The SequenceFile format is described here:

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html
 .
 The format of the keys and values depends on the serialization classes
 used. For example, BytesWritable writes out the length of its byte
 array followed by the actual bytes in the array (see the write()
 method in BytesWritable).

 Hope this helps.
 Tom

 On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote:
  I tried to use SequenceFile.Writer to convert my binaries into Sequence
  Files,
  I read the binary data with FileInputStream, getting all bytes with
  reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
  parameters NullWritable as key, BytesWritable as value. But the content
  changes,
  (I can see that by converting to Base64)
 
  Binary File:
  73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81
 65
  103 54 81 65 65 97 81 65 65 65 81 ...
 
  Sequence File:
  73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103
 67
  69 77 65 52 80 86 67 65 73 68 114 ...
 
  Thanks for any points..
  Rasit
 
  2009/2/2 Rasit OZDAS rasitoz...@gmail.com
 
  Hi,
  I tried to use SequenceFileInputFormat, for this I appended SEQ as
 first
  bytes of my binary files (with hex editor).
  but I get this exception:
 
  A record version mismatch occured. Expecting v6, found v32
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
  at
  org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
  at
 
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
  at
 
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
  at org.apache.hadoop.mapred.Child.main(Child.java:155)
 
  What could it be? Is it not enough just to add SEQ to binary files?
  I use Hadoop v.0.19.0 .
 
  Thanks in advance..
  Rasit
 
 
  different *version* of *Hadoop* between your server and your client.
 
  --
  M. Raşit ÖZDAŞ
 
 
 
 
  --
  M. Raşit ÖZDAŞ
 




-- 
M. Raşit ÖZDAŞ