how to preserve original line order?
The task should be simple, I want to put in uppercase all the words of a (large) file. I tried the following: - streaming mode - the mapper is a perl script that put each line in uppercase (number of mappers 1) - no reducer (number of reducers set to zero) It works fine except for line order which is not preserved. How to preserve the original line order? I would appreciate any suggestion. Roldano
Re: how to preserve original line order?
associate with each line an identifier (eg line number) and afterwards resort the data by that Miles 2009/3/13 Roldano Cattoni catt...@fbk.eu: The task should be simple, I want to put in uppercase all the words of a (large) file. I tried the following: - streaming mode - the mapper is a perl script that put each line in uppercase (number of mappers 1) - no reducer (number of reducers set to zero) It works fine except for line order which is not preserved. How to preserve the original line order? I would appreciate any suggestion. Roldano -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
csv input format handling and mapping
Hi Can anyone share his experience or solution for the following problem? I'm having to deal with a lot of different file formats, most of them csv. Each of them shares similar semantics, ie. fields in file A exists in file B as well. What I'm not sure of is the exact index of the field in the csv file. Fields in file A may also have different names for the same thing as in file B. Simplified example: affiliateA.csv: Date; Clicks; Views; Orders 2009-03-10; 10; 20; 4 affiliateB.csv Date; Orders; Impressions; Clicks 13/03/09; 40; 2000; 1000 Possible mapping file: field-mapping field id=date type=java.util.Date/ field id=clicks type=java.lang.Integer/ field id=views type=java.lang.Integer/ file path pattern=/affiliateA/*.csv format type=csv seperator\t/seperator quotesquot;/quotes /format columns column name=Date alias=date format-MM-DD/format /column column name=Clicks alias=clicks/ /columns /file file path pattern=/affiliateB/*.csv format type=csv seperator;/seperator /format columns column index=1 alias=date formatdd/MM/yyy/format /column column index=2 alias=clicks/ /columns /file /field-mapping What I'd like to be able is to use this external descriptor for each file with a custom hadoop InputFormat. Instead of a line of text, my MR values would be a Map containing the parsed values mapped to the field IDs. map(key, fields) { Date date = fields.get('date'); Integer clicks = fields.get('clicks'); } This would allow me to uncouple my MR job from the actual file format and also moves all csv handling code out of my mappers. Does anyone know if such a solution already exists for hadoop? Any thoughts? Stefan
Reduce task going away for 10 seconds at a time
Hi folks, I've been debugging a severe performance problems with a Hadoop-based application (a highly modified version of Nutch). I've recently upgraded to Hadoop 0.19.1 from a much, much older version, and a reduce that used to work just fine is now running orders of magnitude more slowly. From the logs I can see that progress of my reduce stops for periods that average almost exactly 10 seconds (with a very narrow distribution around 10 seconds), and it does so in various places in my code, but more or less in proportion to how much time I'd expect the task would normally spend in that particular place in the code, i.e. the behavior seems like my code is randomly being interrupted for 10 seconds at a time. I'm planning to keep digging, but thought that these symptoms might sound familiar to someone on this list. Ring any bells? Your help much appreciated. Thanks! Doug Cook -- View this message in context: http://www.nabble.com/Reduce-task-going-away-for-10-seconds-at-a-time-tp22496810p22496810.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: how to upload files by web page
Hi Even I was looking for solution of the same problem. I haven't tested but I think we can use Globus Toolkit's GSI-FTP feature for this work. In the RSL config file one can write the hdfs copy command to copy the file to hdfs. I've used this feature to upload and process file from Globus to Sun N1 Grid Engine. --nitesh 2009/3/10 Yang Zhou yangzhou.e...@gmail.com: :-) I am afraid you have to solve both of your questions yourself. 1. submit the urls to your own servlet. 2. develop your own codes to read input bytes from those urls and save them to HDFS. There is no ready-made tool. Good Luck. 2009/3/10 李睿 lrvb...@gmail.com Thanks:) Could you tell more detail about your solution? I have some questions below: 1,where can I submit the urls to ? 2,what is the backend service? Does it belong to HDFS? 2009/3/10 Yang Zhou yangzhou.e...@gmail.com Hi, I have done that before. My solution is : 1. submit some FTP/SFTP/GridFTP urls of what you want to upload 2. backend service will fetch those files/directories from FTP to HDFS directly. Of course you can upload those files to the web server machine and then move them to HDFS. But since Hadoop is designed to process vast amounts of data, I do think my solution is more efficient. :-) You can find how to make directory and save files to HDFS in the source code of org.apache.hadoop.fs.FsShell. 2009/3/9 lrvb...@gmail.com Hi, all, I'm new to HDFS and want to upload files by JSP. Are there some APIs can use? Are there some demo? Thanks for your help:) -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: tuning performance
On 3/13/09 11:25 AM, Vadim Zaliva kroko...@gmail.com wrote: When you stripe you automatically make every disk in the system have the same speed as the slowest disk. In our experiences, systems are more likely to have a 'slow' disk than a dead one and detecting that is really really hard. In a distributed system, that multiplier effect can have significant consequences on the whole grids performance. All disk are the same, so there is no speed difference. There will be when they start to fail. :)
Re: Creating Lucene index in Hadoop
Or you can check out the index contrib. The difference of the two is that: - In Nutch's indexing map/reduce job, indexes are built in the reduce phase. Afterwards, they are merged into smaller number of shards if necessary. The last time I checked, the merge process does not use map/reduce. - In contrib/index, small indexes are built in the map phase. They are merged into the desired number of shards in the reduce phase. In addition, they can be merged into existing shards. Cheers, Ning On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 imcap...@126.com wrote: you can see the nutch code. 2009/3/13 Mark Kerzner markkerz...@gmail.com Hi, How do I allow multiple nodes to write to the same index file in HDFS? Thank you, Mark
Hadoop Upgrade Wiki
Step 8 of the upgrade process mentions copying the 'edits' and 'fsimage' file to a backup directory. After step 19 it says: 'In case of failure the administrator should have the checkpoint files in order to be able to repeat the procedure from the appropriate point or to restart the old version of Hadoop.' Is this different from running 'start-dfs.sh -rollback' ? I'm not sure if the Wiki is outdated or not. If its the same then step #8 can be skipped altogether I'm guessing.. thanks
Cloudera Hadoop and Hive training now free online
Hey there, today we released our basic Hadoop and Hive training online. Access is free, and we address questions through Get Satisfaction. Many on this list are surely pros, but when you have friends trying to get up to speed, feel free to send this along. We provide a VM so new users can start doing the exercises right away. http://www.cloudera.com/hadoop-training-basic Cheers, Christophe
Re: Building Release 0.19.1
There may be a separate issue with windows, but the error related to: [javac] import org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut; is the eclipse 3.4 issue that is addressed by the patch in https://issues.apache.org/jira/browse/HADOOP-3744
null value output from map...
In writing a Map/Reduce job I ran across something I found a little strange. I have a situation where I don't need a value output from map. If I set the value of the value of OutputCollectorText, IntWritable to null I get the following exception: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56 2) Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense why it would throw the exception: if (value.getClass() != valClass) { throw new IOException(Type mismatch in value from map: expected + valClass.getName() + , recieved + value.getClass().getName()); } I guess my question is as follows: is it a bad idea/not normal to collect a null value in map? Outputting from reduce through TextOutputFormat with a null value as I expect. If the value is null only they key and newline are output. Any thoughts would be appreciated.
Re: Cloudera Hadoop and Hive training now free online
Hi, This is excellent! Does any of these presentations deal specifically with processing tree and graph data structures? I know that some basics can be found in the fifth MapReduce lecture here (http://www.youtube.com/watch?v=BT-piFBP4fE) presented by Aaron Kimball or here ( http://video.google.com/videoplay?docid=741403180270990805) by Barry Brumit but something more detailed and comparing different approaches would be really helpful. Tree is often used in many algorithms (not only it can express hierarchy but can be used to compress data and many other fancy things...). I think there should be some knowledge about what works well and what does not with connection to MapReduce and trees (or graphs). I am looking for this information. Regards, Lukas On Fri, Mar 13, 2009 at 9:42 PM, Christophe Bisciglia christo...@cloudera.com wrote: Hey there, today we released our basic Hadoop and Hive training online. Access is free, and we address questions through Get Satisfaction. Many on this list are surely pros, but when you have friends trying to get up to speed, feel free to send this along. We provide a VM so new users can start doing the exercises right away. http://www.cloudera.com/hadoop-training-basic Cheers, Christophe
Re: null value output from map...
You can initialize IntWritable with an empty constructor. IntWritable i=new IntWritable(); On Fri, Mar 13, 2009 at 2:21 PM, Andy Sautins andy.saut...@returnpath.netwrote: In writing a Map/Reduce job I ran across something I found a little strange. I have a situation where I don't need a value output from map. If I set the value of the value of OutputCollectorText, IntWritable to null I get the following exception: java.lang.NullPointerException at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:56 2) Looking at the code in MapTask.java ( Hadoop .19.1 ) it makes sense why it would throw the exception: if (value.getClass() != valClass) { throw new IOException(Type mismatch in value from map: expected + valClass.getName() + , recieved + value.getClass().getName()); } I guess my question is as follows: is it a bad idea/not normal to collect a null value in map? Outputting from reduce through TextOutputFormat with a null value as I expect. If the value is null only they key and newline are output. Any thoughts would be appreciated. -- Richa Khandelwal University Of California, Santa Cruz. Ph:425-241-7763
Re: null value output from map...
On Mar 13, 2009, at 3:56 PM, Richa Khandelwal wrote: You can initialize IntWritable with an empty constructor. IntWritable i=new IntWritable(); NullWritable is better for that application than IntWritable. It doesn't consume any space when serialized. *smile* -- Owen
Re: Reducers spawned when mapred.reduce.tasks=0
fwiw, we have released a workaround for this issue in Cascading 1.0.5. http://www.cascading.org/ http://cascading.googlecode.com/files/cascading-1.0.5.tgz In short, Hadoop 0.19.0 and .1 instantiate the users Reducer class and subsequently calls configure() when there is no intention to use the class (during job/task cleanup tasks). This clearly can cause havoc for users who use configure() to initialize resources used by the reduce() method. Testing for jobConf.getNumReduceTasks() is 0 inside the configure() method seems to work out well. branch-0.19 looks like it won't instantiate the Reducer class during job/task cleanup tasks, so I expect will leak into future releases. cheers, ckw On Mar 12, 2009, at 8:20 PM, Amareshwari Sriramadasu wrote: Are you seeing reducers getting spawned from web ui? then, it is a bug. If not, there won't be reducers spawned, it could be job-setup/ job- cleanup task that is running on a reduce slot. See HADOOP-3150 and HADOOP-4261. -Amareshwari Chris K Wensel wrote: May have found the answer, waiting on confirmation from users. Turns out 0.19.0 and .1 instantiate the reducer class when the task is actually intended for job/task cleanup. branch-0.19 looks like it resolves this issue by not instantiating the reducer class in this case. I've got a workaround in the next maint release: http://github.com/cwensel/cascading/tree/wip-1.0.5 ckw On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote: Hey all Have some users reporting intermittent spawning of Reducers when the job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1. This is also confirmed when jobConf is queried in the (supposedly ignored) Reducer implementation. In general this issue would likely go unnoticed since the default reducer is IdentityReducer. but since it should be ignored in the Mapper only case, we don't bother not setting the value, and subsequently comes to ones attention rather abruptly. am happy to open a JIRA, but wanted to see if anyone else is experiencing this issue. note the issue seems to manifest with or without spec exec. ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Cloudera Hadoop and Hive training now free online
Hey Lukas, we love hearing about what you'd like to see in training. If you make a note on get satisfaction, we'll track it and keep you appraised of updates: http://getsatisfaction.com/cloudera/products/cloudera_hadoop_training Christophe On Fri, Mar 13, 2009 at 2:27 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, This is excellent! Does any of these presentations deal specifically with processing tree and graph data structures? I know that some basics can be found in the fifth MapReduce lecture here (http://www.youtube.com/watch?v=BT-piFBP4fE) presented by Aaron Kimball or here ( http://video.google.com/videoplay?docid=741403180270990805) by Barry Brumit but something more detailed and comparing different approaches would be really helpful. Tree is often used in many algorithms (not only it can express hierarchy but can be used to compress data and many other fancy things...). I think there should be some knowledge about what works well and what does not with connection to MapReduce and trees (or graphs). I am looking for this information. Regards, Lukas On Fri, Mar 13, 2009 at 9:42 PM, Christophe Bisciglia christo...@cloudera.com wrote: Hey there, today we released our basic Hadoop and Hive training online. Access is free, and we address questions through Get Satisfaction. Many on this list are surely pros, but when you have friends trying to get up to speed, feel free to send this along. We provide a VM so new users can start doing the exercises right away. http://www.cloudera.com/hadoop-training-basic Cheers, Christophe
HTTP addressable files from HDFS?
Hello I realize that using HTTP, you can have a file in HDFS streamed - that is, the servlet responds to the following request with Content- Disposition: attachment, and a download is forced (at least from a browsers perspective) like so: http://localhost:50075/streamFile?filename=/somewhere/image.jpg Is there another way to get at this file more directly from HTTP 'out of the box'? I'm imagining something like: http://localhost:50075/somewhere/image.jpg Is this sort of exposure of the HDFS namespace something I need to write into a server myself? Thanks in advance David On Mar 13, 2009, at 10:12 PM, S D wrote: I've used wget with Hadoop Streaming without any problems. Based on the error code you're getting, I suggest you make sure that you have the proper write permissions for the directory in which Hadoop will process (e.g., download, convert, ...) on each of the task tracker machines. The location where is processed on each machine is controlled by the hadoop.tmp.dir variable. The default value set in $HADOOP_HOME/conf/hadoop- default.xml is /tmp/hadoop-${user.name}. Make sure that the user running hadoop has permission to write to whatever directory you're using. John On Thu, Mar 12, 2009 at 10:02 PM, Nick Cen cenyo...@gmail.com wrote: Hi All, I am trying to use the hadoop straeming with wget to simulate a distributed downloader. The command line i use is ./bin/hadoop jar -D mapred.reduce.tasks=0 contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo -mapper /usr/bin/wget -outputformat org.apache.hadoop.mapred.lib.MultipleTextOutputFormat But it thrown an exception java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org .apache .hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295) at org .apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java: 519) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) can somebody point me a way of why this happend. thanks. -- http://daily.appspot.com/food/
Re: HTTP addressable files from HDFS?
wget http://namenode:port/*data/*filename will return the filename. The namenode will redirect the http request to a datanode that has at least some of the blocks in local storage to serve the actual request. The key piece of course is the /data prefix on the file name. port is the port that the webgui is running on, NOT the HDFS port. commonly the port is 50070. On Fri, Mar 13, 2009 at 7:54 PM, David Michael david.mich...@gmail.comwrote: Hello I realize that using HTTP, you can have a file in HDFS streamed - that is, the servlet responds to the following request with Content-Disposition: attachment, and a download is forced (at least from a browsers perspective) like so: http://localhost:50075/streamFile?filename=/somewhere/image.jpg Is there another way to get at this file more directly from HTTP 'out of the box'? I'm imagining something like: http://localhost:50075/somewhere/image.jpg Is this sort of exposure of the HDFS namespace something I need to write into a server myself? Thanks in advance David On Mar 13, 2009, at 10:12 PM, S D wrote: I've used wget with Hadoop Streaming without any problems. Based on the error code you're getting, I suggest you make sure that you have the proper write permissions for the directory in which Hadoop will process (e.g., download, convert, ...) on each of the task tracker machines. The location where is processed on each machine is controlled by the hadoop.tmp.dir variable. The default value set in $HADOOP_HOME/conf/hadoop-default.xml is /tmp/hadoop-${user.name}. Make sure that the user running hadoop has permission to write to whatever directory you're using. John On Thu, Mar 12, 2009 at 10:02 PM, Nick Cen cenyo...@gmail.com wrote: Hi All, I am trying to use the hadoop straeming with wget to simulate a distributed downloader. The command line i use is ./bin/hadoop jar -D mapred.reduce.tasks=0 contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo -mapper /usr/bin/wget -outputformat org.apache.hadoop.mapred.lib.MultipleTextOutputFormat But it thrown an exception java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:519) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) can somebody point me a way of why this happend. thanks. -- http://daily.appspot.com/food/ -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422