how to preserve original line order?

2009-03-13 Thread Roldano Cattoni
The task should be simple, I want to put in uppercase all the words of a
(large) file.

I tried the following:
 - streaming mode
 - the mapper is a perl script that put each line in uppercase (number of
   mappers  1)
 - no reducer (number of reducers set to zero)

It works fine except for line order which is not preserved.

How to preserve the original line order?

I would appreciate any suggestion.

  Roldano



Re: how to preserve original line order?

2009-03-13 Thread Miles Osborne
associate with each line an identifier (eg line number) and afterwards
resort the data by that

Miles

2009/3/13 Roldano Cattoni catt...@fbk.eu:
 The task should be simple, I want to put in uppercase all the words of a
 (large) file.

 I tried the following:
  - streaming mode
  - the mapper is a perl script that put each line in uppercase (number of
   mappers  1)
  - no reducer (number of reducers set to zero)

 It works fine except for line order which is not preserved.

 How to preserve the original line order?

 I would appreciate any suggestion.

  Roldano





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


csv input format handling and mapping

2009-03-13 Thread Stefan Podkowinski
Hi

Can anyone share his experience or solution for the following problem?
I'm having to deal with a lot of different file formats, most of them csv.
Each of them shares similar semantics, ie. fields in file A exists in
file B as well.
What I'm not sure of is the exact index of the field in the csv file.
Fields in file A may also have different names for the same thing as in file B.

Simplified example:

affiliateA.csv:
Date; Clicks; Views; Orders
2009-03-10; 10; 20; 4

affiliateB.csv
Date; Orders; Impressions; Clicks
13/03/09; 40; 2000; 1000


Possible mapping file:

field-mapping
  field id=date type=java.util.Date/
  field id=clicks type=java.lang.Integer/
  field id=views type=java.lang.Integer/

  file
 path pattern=/affiliateA/*.csv
 format type=csv
seperator\t/seperator
quotesquot;/quotes
 /format
 columns
column name=Date alias=date
 format-MM-DD/format
/column
column name=Clicks alias=clicks/
 /columns
  /file

  file
 path pattern=/affiliateB/*.csv
 format type=csv
seperator;/seperator
 /format
 columns
column index=1 alias=date
 formatdd/MM/yyy/format
/column
column index=2 alias=clicks/
 /columns
  /file
/field-mapping


What I'd like to be able is to use this external descriptor for each
file with a custom hadoop InputFormat.
Instead of a line of text, my MR values would be a Map containing the
parsed values mapped to the field IDs.

map(key, fields) {
  Date date = fields.get('date');
  Integer clicks = fields.get('clicks');
}

This would allow me to uncouple my MR job from the actual file format
and also moves all csv handling code out of my mappers.

Does anyone know if such a solution already exists for hadoop? Any thoughts?

Stefan


Reduce task going away for 10 seconds at a time

2009-03-13 Thread Doug Cook

Hi folks,

I've been debugging a severe performance problems with a Hadoop-based
application (a highly modified version of Nutch). I've recently upgraded to
Hadoop 0.19.1 from a much, much older version, and a reduce that used to
work just fine is now running orders of magnitude more slowly. 

From the logs I can see that progress of my reduce stops for periods that
average almost exactly 10 seconds (with a very narrow distribution around 10
seconds), and it does so in various places in my code, but more or less in
proportion to how much time I'd expect the task would normally spend in that
particular place in the code, i.e. the behavior seems like my code is
randomly being interrupted for 10 seconds at a time. 

I'm planning to keep digging, but thought that these symptoms might sound
familiar to someone on this list. Ring any bells? Your help much
appreciated. 

Thanks!

Doug Cook
-- 
View this message in context: 
http://www.nabble.com/Reduce-task-going-away-for-10-seconds-at-a-time-tp22496810p22496810.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: how to upload files by web page

2009-03-13 Thread nitesh bhatia
Hi
Even I was looking for solution of the same problem. I haven't tested
but I think we can use Globus Toolkit's GSI-FTP feature for this work.
 In the RSL config file one can write the hdfs copy command to copy
the file to hdfs. I've used this feature to upload and process file
from Globus to Sun N1 Grid Engine.
--nitesh


2009/3/10 Yang Zhou yangzhou.e...@gmail.com:
 :-) I am afraid you have to solve both of your questions yourself.
 1. submit the urls to your own servlet.
 2. develop your own codes to read input bytes from those urls and save them
 to HDFS.
 There is no ready-made tool.

 Good Luck.

 2009/3/10 李睿 lrvb...@gmail.com

 Thanks:)



 Could you tell more detail about your solution?

 I have some questions below:

 1,where can  I submit the urls to ?

 2,what is the backend service? Does it belong to HDFS?






 2009/3/10 Yang Zhou yangzhou.e...@gmail.com

  Hi,
 
  I have done that before.
 
  My solution is :
  1. submit some FTP/SFTP/GridFTP urls of what you want to upload
  2. backend service will fetch those files/directories from FTP to HDFS
  directly.
 
  Of course you can upload those files to the web server machine and then
  move
  them to HDFS. But since Hadoop is designed to process vast amounts of
 data,
  I do think my solution is more efficient. :-)
 
  You can find how to make directory and save files to HDFS in the source
  code
  of org.apache.hadoop.fs.FsShell.
  2009/3/9 lrvb...@gmail.com
 
  
  
   Hi, all,
  
 I'm new to HDFS and want to upload files by JSP.
  
   Are there some APIs can use?  Are there some demo?
  
 Thanks for your help:)
  
  
 





-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: tuning performance

2009-03-13 Thread Allen Wittenauer



On 3/13/09 11:25 AM, Vadim Zaliva kroko...@gmail.com wrote:

    When you stripe you automatically make every disk in the system have the
 same speed as the slowest disk.  In our experiences, systems are more likely
 to have a 'slow' disk than a dead one and detecting that is really
 really hard.  In a distributed system, that multiplier effect can have
 significant consequences on the whole grids performance.
 
 All disk are the same, so there is no speed difference.

There will be when they start to fail. :)




Re: Creating Lucene index in Hadoop

2009-03-13 Thread Ning Li
Or you can check out the index contrib. The difference of the two is that:
  - In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
  - In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.

Cheers,
Ning


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner markkerz...@gmail.com

 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark




Hadoop Upgrade Wiki

2009-03-13 Thread Mayuran Yogarajah
Step 8 of the upgrade process mentions copying the 'edits' and 'fsimage' 
file

to a backup directory.  After step 19 it says:

'In case of failure the administrator should have the checkpoint files
in order to be able to repeat the procedure from the appropriate point
or to restart the old version of Hadoop.'


Is this different from running 'start-dfs.sh -rollback' ?
I'm not sure if the Wiki is outdated or not.  If its the same then step 
#8 can be skipped

altogether I'm guessing..

thanks


Cloudera Hadoop and Hive training now free online

2009-03-13 Thread Christophe Bisciglia
Hey there, today we released our basic Hadoop and Hive training
online. Access is free, and we address questions through Get
Satisfaction.

Many on this list are surely pros, but when you have friends trying to
get up to speed, feel free to send this along. We provide a VM so new
users can start doing the exercises right away.

http://www.cloudera.com/hadoop-training-basic

Cheers,
Christophe


Re: Building Release 0.19.1

2009-03-13 Thread Kevin Peterson
There may be a separate issue with windows, but the error related to:

[javac] import
org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut;

is the eclipse 3.4 issue that is addressed by the patch in
https://issues.apache.org/jira/browse/HADOOP-3744


null value output from map...

2009-03-13 Thread Andy Sautins
 

   In writing a Map/Reduce job I ran across something I found a little
strange.  I have a situation where I don't need a value output from map.
If I set the value of the value of OutputCollectorText, IntWritable to
null I get the following exception:

 

java.lang.NullPointerException

   at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56
2)

 

Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense
why it would throw the exception:

 

  if (value.getClass() != valClass) {

throw new IOException(Type mismatch in value from map: expected


  + valClass.getName() + , recieved 

  + value.getClass().getName());

  }

 

  I guess my question is as follows: is it a bad idea/not normal to
collect a null value in map?  Outputting from reduce through
TextOutputFormat with a null value as I expect.  If the value is null
only they key and newline are output.  

 

   Any thoughts would be appreciated.

  

 

   



Re: Cloudera Hadoop and Hive training now free online

2009-03-13 Thread Lukáš Vlček
Hi,
This is excellent!

Does any of these presentations deal specifically with processing tree and
graph data structures? I know that some basics can be found in the fifth
MapReduce lecture here (http://www.youtube.com/watch?v=BT-piFBP4fE)
presented by Aaron Kimball or here (
http://video.google.com/videoplay?docid=741403180270990805) by Barry Brumit
but something more detailed and comparing different approaches would be
really helpful.

Tree is often used in many algorithms (not only it can express hierarchy but
can be used to compress data and many other fancy things...). I think there
should be some knowledge about what works well and what does not with
connection to MapReduce and trees (or graphs). I am looking for this
information.

Regards,
Lukas

On Fri, Mar 13, 2009 at 9:42 PM, Christophe Bisciglia 
christo...@cloudera.com wrote:

 Hey there, today we released our basic Hadoop and Hive training
 online. Access is free, and we address questions through Get
 Satisfaction.

 Many on this list are surely pros, but when you have friends trying to
 get up to speed, feel free to send this along. We provide a VM so new
 users can start doing the exercises right away.

 http://www.cloudera.com/hadoop-training-basic

 Cheers,
 Christophe



Re: null value output from map...

2009-03-13 Thread Richa Khandelwal
You can initialize IntWritable with an empty constructor.
IntWritable i=new IntWritable();

On Fri, Mar 13, 2009 at 2:21 PM, Andy Sautins
andy.saut...@returnpath.netwrote:



   In writing a Map/Reduce job I ran across something I found a little
 strange.  I have a situation where I don't need a value output from map.
 If I set the value of the value of OutputCollectorText, IntWritable to
 null I get the following exception:



 java.lang.NullPointerException

   at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56
 2)



Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense
 why it would throw the exception:



  if (value.getClass() != valClass) {

throw new IOException(Type mismatch in value from map: expected
 

  + valClass.getName() + , recieved 

  + value.getClass().getName());

  }



  I guess my question is as follows: is it a bad idea/not normal to
 collect a null value in map?  Outputting from reduce through
 TextOutputFormat with a null value as I expect.  If the value is null
 only they key and newline are output.



   Any thoughts would be appreciated.










-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763


Re: null value output from map...

2009-03-13 Thread Owen O'Malley


On Mar 13, 2009, at 3:56 PM, Richa Khandelwal wrote:


You can initialize IntWritable with an empty constructor.
IntWritable i=new IntWritable();


NullWritable is better for that application than IntWritable. It  
doesn't consume any space when serialized. *smile*


-- Owen


Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-13 Thread Chris K Wensel

fwiw, we have released a workaround for this issue in Cascading 1.0.5.

http://www.cascading.org/
http://cascading.googlecode.com/files/cascading-1.0.5.tgz

In short, Hadoop 0.19.0 and .1 instantiate the users Reducer class and  
subsequently calls configure() when there is no intention to use the  
class (during job/task cleanup tasks).


This clearly can cause havoc for users who use configure() to  
initialize resources used by the reduce() method.


Testing for jobConf.getNumReduceTasks() is 0 inside the configure()  
method seems to work out well.


branch-0.19 looks like it won't instantiate the Reducer class during  
job/task cleanup tasks, so I expect will leak into future releases.


cheers,

ckw

On Mar 12, 2009, at 8:20 PM, Amareshwari Sriramadasu wrote:

Are you seeing reducers getting spawned from web ui? then, it is a  
bug.
If not, there won't be reducers spawned, it could be job-setup/ job- 
cleanup task that is running on a reduce slot. See HADOOP-3150 and  
HADOOP-4261.

-Amareshwari
Chris K Wensel wrote:


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task  
is actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating  
the reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when  
the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly  
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default  
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't  
bother not setting the value, and subsequently comes to ones  
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is  
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/





--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Re: Cloudera Hadoop and Hive training now free online

2009-03-13 Thread Christophe Bisciglia
Hey Lukas, we love hearing about what you'd like to see in training.
If you make a note on get satisfaction, we'll track it and keep you
appraised of updates:
http://getsatisfaction.com/cloudera/products/cloudera_hadoop_training

Christophe

On Fri, Mar 13, 2009 at 2:27 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:
 Hi,
 This is excellent!

 Does any of these presentations deal specifically with processing tree and
 graph data structures? I know that some basics can be found in the fifth
 MapReduce lecture here (http://www.youtube.com/watch?v=BT-piFBP4fE)
 presented by Aaron Kimball or here (
 http://video.google.com/videoplay?docid=741403180270990805) by Barry Brumit
 but something more detailed and comparing different approaches would be
 really helpful.

 Tree is often used in many algorithms (not only it can express hierarchy but
 can be used to compress data and many other fancy things...). I think there
 should be some knowledge about what works well and what does not with
 connection to MapReduce and trees (or graphs). I am looking for this
 information.

 Regards,
 Lukas

 On Fri, Mar 13, 2009 at 9:42 PM, Christophe Bisciglia 
 christo...@cloudera.com wrote:

 Hey there, today we released our basic Hadoop and Hive training
 online. Access is free, and we address questions through Get
 Satisfaction.

 Many on this list are surely pros, but when you have friends trying to
 get up to speed, feel free to send this along. We provide a VM so new
 users can start doing the exercises right away.

 http://www.cloudera.com/hadoop-training-basic

 Cheers,
 Christophe




HTTP addressable files from HDFS?

2009-03-13 Thread David Michael

Hello

I realize that using HTTP, you can have a file in HDFS streamed - that  
is, the servlet responds to the following request with Content- 
Disposition: attachment, and a download is forced (at least from a  
browsers perspective) like so:


http://localhost:50075/streamFile?filename=/somewhere/image.jpg

Is there another way to get at this file more directly from HTTP 'out  
of the box'?


I'm imagining something like:

http://localhost:50075/somewhere/image.jpg

Is this sort of exposure of the HDFS namespace something I need to  
write into a server myself?


Thanks in advance
David

On Mar 13, 2009, at 10:12 PM, S D wrote:

I've used wget with Hadoop Streaming without any problems. Based on  
the
error code you're getting, I suggest you make sure that you have the  
proper
write permissions for the directory in which Hadoop will process  
(e.g.,
download, convert, ...) on each of the task tracker machines. The  
location
where is processed on each machine is controlled by the  
hadoop.tmp.dir
variable. The default value set in $HADOOP_HOME/conf/hadoop- 
default.xml is

/tmp/hadoop-${user.name}. Make sure that the user running hadoop has
permission to write to whatever directory you're using.

John

On Thu, Mar 12, 2009 at 10:02 PM, Nick Cen cenyo...@gmail.com wrote:


Hi All,

I am trying to use the hadoop straeming with wget to simulate a
distributed downloader.
The command line i use is

./bin/hadoop jar -D mapred.reduce.tasks=0
contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output  
urlo

-mapper /usr/bin/wget -outputformat
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

But it thrown an exception

java.lang.RuntimeException: PipeMapRed.waitOutputThreads():  
subprocess

failed with code 1
  at
org 
.apache 
.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295)

  at
org 
.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 
519)
  at  
org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)

  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
  at org.apache.hadoop.mapred.Child.main(Child.java:155)

can somebody point me a way of why this happend. thanks.



--
http://daily.appspot.com/food/





Re: HTTP addressable files from HDFS?

2009-03-13 Thread jason hadoop
wget http://namenode:port/*data/*filename
will return the filename.

The namenode will redirect the http request to a datanode that has at least
some of the blocks in local storage to serve the actual request.
The key piece of course is the /data prefix on the file name.
port is the port that the webgui is running on, NOT the HDFS port.
commonly the port is 50070.

On Fri, Mar 13, 2009 at 7:54 PM, David Michael david.mich...@gmail.comwrote:

 Hello

 I realize that using HTTP, you can have a file in HDFS streamed - that is,
 the servlet responds to the following request with Content-Disposition:
 attachment, and a download is forced (at least from a browsers perspective)
 like so:

 http://localhost:50075/streamFile?filename=/somewhere/image.jpg

 Is there another way to get at this file more directly from HTTP 'out of
 the box'?

 I'm imagining something like:

 http://localhost:50075/somewhere/image.jpg

 Is this sort of exposure of the HDFS namespace something I need to write
 into a server myself?

 Thanks in advance
 David

 On Mar 13, 2009, at 10:12 PM, S D wrote:

  I've used wget with Hadoop Streaming without any problems. Based on the
 error code you're getting, I suggest you make sure that you have the
 proper
 write permissions for the directory in which Hadoop will process (e.g.,
 download, convert, ...) on each of the task tracker machines. The location
 where is processed on each machine is controlled by the hadoop.tmp.dir
 variable. The default value set in $HADOOP_HOME/conf/hadoop-default.xml is
 /tmp/hadoop-${user.name}. Make sure that the user running hadoop has
 permission to write to whatever directory you're using.

 John

 On Thu, Mar 12, 2009 at 10:02 PM, Nick Cen cenyo...@gmail.com wrote:

  Hi All,

 I am trying to use the hadoop straeming with wget to simulate a
 distributed downloader.
 The command line i use is

 ./bin/hadoop jar -D mapred.reduce.tasks=0
 contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo
 -mapper /usr/bin/wget -outputformat
 org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

 But it thrown an exception

 java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
 failed with code 1
  at

 org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295)
  at

 org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:519)
  at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
  at org.apache.hadoop.mapred.Child.main(Child.java:155)

 can somebody point me a way of why this happend. thanks.



 --
 http://daily.appspot.com/food/





-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422