Re: Hadoop overhead

2008-01-16 Thread Johan Oskarsson
I simply followed the wiki The right level of parallelism for maps seems to be around 10-100 maps/node, http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces We have 8 cores in each machine, so perhaps 100 mappers ought to be right, it's set to 157 in the config but hadoop used ~200 for

Re: unable to figure out this exception from reduce task

2008-01-16 Thread Jim the Standing Bear
Thanks Runping. It seems the bug is still open. However, in my case, there were more than plenty of disk space available. On Jan 16, 2008 2:44 AM, Runping Qi [EMAIL PROTECTED] wrote: I encountered a similar case. Here is the Jira: https://issues.apache.org/jira/browse/HADOOP-2164 Runping

Re: how to deploy hadoop on many PCs quickly?

2008-01-16 Thread Bin YANG
I use the Norton Ghost 8.0 ghost a whole ubuntu hard disk to a image, and restore another hard disk from the image, but the restored hard disk cannnot start up ubuntu successfully. The GRUB said error 22. Does somebody know how to fix the problem? Thanks. Bin YANG On Jan 16, 2008 4:54 AM,

Re: Hadoop overhead

2008-01-16 Thread Ted Dunning
There is some considerable and very understandable confusion about map tasks, mappers and input splits. It is true that for large inputs the input should ultimately be split into chunks so that each core that you have has to process 10-100 pieces of data. To do that, however, you only need one

Re: how to deploy hadoop on many PCs quickly?

2008-01-16 Thread Ted Dunning
This isn't really a question about Hadoop, but is about system administration basics. You are probably missing a master boot record (MBR) on the disk. Ask a local linux expert to help you or look at the Norton documentation. On 1/16/08 4:59 AM, Bin YANG [EMAIL PROTECTED] wrote: I use the

a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
Hi, How do I make hadoop split its output? The program I am writing crawls a catalog tree from a single url, so initially the input contains only one entry. after a few iterations, it will have tens of thousands of urls. But what I noticed is that the file is always in one block (part-0).

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
Parallelizing the processing of data occurs at two steps. The first is during the map phase where the input data file is (hopefully) split across multiple tasks. This should happen transparently most of the time unless you have a perverse data format or use unsplittable compression on your

Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
Thanks Ted. I just didn't ask it right. Here is a stupid 101 question, which I am sure the answer lies in the documentation somewhere, just that I was having some difficulties in finding it... when I do an ls on the dfs, I would see this: /user/bear/output/part-0 r 4 I probably got

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
The part nomenclature does not refer to splits. It refers to how many reduce processes were involved in actually writing the output file. Files are split at read-time as necessary. You will get more of them if you have more reducers. On 1/16/08 8:25 AM, Jim the Standing Bear [EMAIL

Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
hmm.. interesting... these are supposed to be the output from mappers (and default reducers since I didn't specify any for those jobs)... but shouldn't the number of reducers match the number of mappers? If there was only one reducer, it would mean I only had one mapper task running?? That is

Re: writing output files in hadoop streaming

2008-01-16 Thread John Heidemann
On 1/15/08 12:54 PM, Miles Osborne [EMAIL PROTECTED] wrote: surely the clean way (in a streaming environment) would be to define a representation of some kind which serialises the output. http://en.wikipedia.org/wiki/Serialization after your mappers and reducers have completed, you would

Re: a question on number of parallel tasks

2008-01-16 Thread Miles Osborne
The number of reduces should be a function of the amount of data needing reducing, not the number of mappers. For example, your mappers might delete 90% of the input data, in which case you should only need 1/10 of the number of reducers as mappers. Miles On 16/01/2008, Jim the Standing Bear

Platform reliability with Hadoop

2008-01-16 Thread Jeff Eastman
I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen machines in our CUBiT array for the last month. During this time I have experienced two major data corruption losses on relatively small amounts of data (50gb) that make me wonder about the suitability of this platform for

Re: a question on number of parallel tasks

2008-01-16 Thread Jim the Standing Bear
Thanks, Miles. On Jan 16, 2008 11:51 AM, Miles Osborne [EMAIL PROTECTED] wrote: The number of reduces should be a function of the amount of data needing reducing, not the number of mappers. For example, your mappers might delete 90% of the input data, in which case you should only need

Hadoop summit / workshop at Yahoo!

2008-01-16 Thread Ajay Anand
Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like

Re: Platform reliability with Hadoop

2008-01-16 Thread lohit . vijayarenu
The DFS is stored in /tmp on each box. The developers who own the machines occasionally reboot and reprofile them Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this be the reason you see data corruption? Good idea is to configure DFS to be any place other than /tmp

Re: Platform reliability with Hadoop

2008-01-16 Thread Jason Venner
The /tmp default has caught us once or twice too. Now we put the files elsewhere. [EMAIL PROTECTED] wrote: The DFS is stored in /tmp on each box. The developers who own the machines occasionally reboot and reprofile them Wont you lose your blocks after reboot since /tmp gets cleaned up?

RE: Platform reliability with Hadoop

2008-01-16 Thread Jeff Eastman
Thanks, I will try a safer place for the DFS. Jeff -Original Message- From: Jason Venner [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 16, 2008 10:04 AM To: hadoop-user@lucene.apache.org Subject: Re: Platform reliability with Hadoop The /tmp default has caught us once or twice too.

[ANNOUNCEMENT] Hadoop is a TLP

2008-01-16 Thread Doug Cutting
Apache's board this morning voted to make Hadoop a top-level project (TLP). The initial project management committee (PMC) for Hadoop will be composed of the following Hadoop committers: * Andrzej Bialecki [EMAIL PROTECTED] * Doug Cutting [EMAIL PROTECTED] *

copyFromLocal bug?

2008-01-16 Thread Michael Di Domenico
I was trying to copy a bunch of data over to my hadoop installation and got the below error: [EMAIL PROTECTED] ~]$ hadoop/bin/hadoop dfs -copyFromLocal /export/gpfs/en wikipedia copyFromLocal: file:/export/gpfs/en/a/g/e/Ageha100%25.html: No such file or directory However, there is no

RE: copyFromLocal bug?

2008-01-16 Thread edward yoon
It seems bug about RawLocalFileSystem class. I'll made a issue. B. Regards, Edward yoon @ NHN, corp. Date: Wed, 16 Jan 2008 23:03:03 -0500 From: [EMAIL PROTECTED] To: hadoop-user@lucene.apache.org Subject: copyFromLocal bug? I was trying to copy a bunch of data over to my hadoop

RE: copyFromLocal bug?

2008-01-16 Thread edward yoon
I made a issue for this problem. https://issues.apache.org/jira/browse/HADOOP-2635 Thanks. B. Regards, Edward yoon @ NHN, corp. From: [EMAIL PROTECTED] To: hadoop-user@lucene.apache.org Subject: RE: copyFromLocal bug? Date: Thu, 17 Jan 2008 04:55:20 + It seems bug about

Re: Single output file per reduce key?

2008-01-16 Thread Amar Kamat
Hi, Why couldn't you just write this logic in your reducer class. The reduce [reduceClass.reduce()] method is invoked with a key and an iterator over the values associated with the key. You can simply dump the values into a file. Since the input to the reducer is sorted you can simply dump the

Re: Single output file per reduce key?

2008-01-16 Thread Myles Grant
I would like the values for a key to exist in a single file, and only the values for that key. Each reduced key/value would get its own file. If I understand correctly, all output of the reducers is written to a single file. -Myles On Jan 16, 2008, at 9:29 PM, Amar Kamat wrote: Hi,

about using HBase?

2008-01-16 Thread ma qiang
Dear colleagues; Now,I have to use HBase in my map and reduce functions, but I don't know how to use. I have seen the examples in the FQA and org.apache.hadoop.hbase. ,but I can't run it successfully. Can you give me some simple examples make me manipulate HBase using java api in my map

RE: about using HBase?

2008-01-16 Thread edward yoon
Please copy the hadoop-0.16.*-hbase.jar to ${hadoop_home}/lib folder. And, Here's a exmple of hadoop-site.xml hbase.master a51066.nhncorp.com:6 The port for the hbase master web UI Set to -1 if you do not want the info server to run.