Re: Hadoop Wiki

2006-03-06 Thread Doug Cutting
Jeff Ritchie wrote: Hadoop wiki could use some configuration ;) It sure could! I requested that it be created back in January: https://issues.apache.org/jira/browse/INFRA-701 But for some reason no new Apache wiki's have been created since then... Doug

Re: DFS vs GFS

2006-03-14 Thread Doug Cutting
Yonik Seeley wrote: The JavaDoc points out one: single-writer, stream only (no record append, no writing to specific spot in file, etc). Is that a different design decision, or simply something that hasn't been implemented yet? It's a simplification. We may add appends multiple writers

Re: Name node redundancy

2006-03-29 Thread Doug Cutting
Aled Jones wrote: Anyhoo, I'm fairly new to hadoop and was wondering about the redundancy aspects of it. If I have a few servers running for nutch, one being a name and data node, the others just data nodes, what happens when the name node falls over? To get proper redundancy in a hadoop

Re: HADOOP-117 doubts (winxp without cygwin)

2006-04-05 Thread Doug Cutting
Raghavendra Prabhu wrote: Is pure Winxp operation supported now?(ie that is without Cygwin ) as df is supported now (Thanx to the group) No, cygwin is still required. Maybe this is isolated case in winxp operation. But can some one check and confirm?. It would be helpful I try to

Re: HADOOP-117 doubts (winxp without cygwin)

2006-04-05 Thread Doug Cutting
Raghavendra Prabhu wrote: I would like to contribute I will try to write one in my spare time ( the time spent in something other than comprehending the architecture) Great! Please read the contribution instructions on the wiki: http://wiki.apache.org/lucene-hadoop/HowToContribute Thanks,

Re: Purpose of Job.jar

2006-04-06 Thread Doug Cutting
ennis Kubes wrote: I keep seeing references to job.jar files. Can someone explain what the job.jar files are and are they only used in distributed mode? They are only required for distributed operation. They permit a job to provide code that is not installed on all nodes. In general, user

Re: Confusion about the Hadoop conf/slaves file

2006-04-11 Thread Doug Cutting
Scott Simpson wrote: Excuse my ignorance on this issue. Say I have 5 machines in my Hadoop cluster and I only list two of them in the configuration file when I do a fetch or a generate. Won't this just store the data on the two nodes since that is all I've listed for my crawling machines? I'm

Re: Out of memory after Map tasks

2006-05-25 Thread Doug Cutting
Vijay Murthi wrote: Are you running the current trunk? My guess is that you are. If so, then this error is normal, things should keep running. I am using hadoop-0.2.0. I believe this is the current trunk. No, that's a release. The trunk is what's currently in Subversion. I used to think

Re: Teating large numbers of slaves with scheduled downtime

2006-07-24 Thread Doug Cutting
The easiest way would be to not use anything but your reliable machines as datanodes. Alternately, for better performance, you could run two DFS systems, one on all machines, and one on just the reliable machines, and back one up to the other before you shutdown the unreliable nodes each

Re: Task type priorities during scheduling ?

2006-07-25 Thread Doug Cutting
Paul Sutter wrote: it should be possible to have lots of tasks in the shuffle phase (mostly, sitting around waiting for mappers to run), but only have about one actual reduce phase running per cpu (or whatever works for each of our apps) that gets enough memory for a sorter, does substantial

Re: Using Hadoop with NFS mounted file server

2006-08-14 Thread Doug Cutting
You don't want to use DFS on top of NFS. If you use DFS, keep its data on the local drives, not in NFS. If you want to use NFS for shared data, then simply don't use DFS: specify local as the filesystem and don't start datanodes or a namenode. I think you'll find DFS will perform better

Re: Some queries about stability and reliability

2006-08-14 Thread Doug Cutting
Konstantin Shvachko wrote: On the logging issue. I think we should change the default logging level, which is INFO at the moment. I think INFO is the appropriate default logging level. If there are things logged at the INFO level that are too verbose, then we should change these to DEBUG

Re: Compile error with trunk and java 1.4

2006-08-14 Thread Doug Cutting
Renaud Richardet wrote: Does Hadoop requires java 5? Yes. We're not yet extensively using or encouraging Java 5 features, but it is now required. I get a compile error when building the trunk with java 1.4. This change below will make it build again. I think there are more changes

Re: Some queries about stability and reliability

2006-08-15 Thread Doug Cutting
Yoram Arnon wrote: User code data gets written to the tasktracker's log at the INFO level. We switched to WARNING level when a rogue user program produced a lot of output to stdout, and it filled the task trackers' logs with junk. Another approach might be to log warning fatal messages to a

Re: question about Log4j Configuration

2006-08-15 Thread Doug Cutting
Hadoop uses Commons Logging: http://jakarta.apache.org/commons/logging/ One should be able to configure it to use other logging backends or a null logger: http://jakarta.apache.org/commons/logging/commons-logging-1.1/guide.html#Configuration Please tell us how this works. Doug Dilma

Amazon EC2

2006-08-25 Thread Doug Cutting
Has anyone tried running Hadoop on the Amazon Elastic Compute Cloud yet? http://www.amazon.com/gp/browse.html?node=201590011 One way to use Hadoop on this would be to: 1. Allocate a pool of machines. 2. Start Hadoop daemons. 3. Load the HDFS filesystem with input from Amazon S3. 4. Run a

Re: Number of Reduce Outputs

2006-08-29 Thread Doug Cutting
To generate a single output file, specify just a single reduce task. If your reducer isn't doing much computation, then it might be faster to do this in the original job, otherwise use a subsequent job. Doug Dennis Kubes wrote: This is probably a simple question but when I run my MR job I am

Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Doug Cutting
Frédéric Bertin wrote: This should run clientside, since it depends on the username, which is different on the server. then, what about passing the username as a parameter to the JobSubmissionProtocol.submitJob(...) ? This avoids loading the whole JobConf clientside just to set the username.

Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Doug Cutting
Eric Baldeschwieler wrote: Also the thread I started last week on using URLs in general for input arguments. Seems like we should just take a URL for the jar, which could be file: or hdfs: That would work. The jobclient could automatically copy file: urls to the jobtracker's native fs.

Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-09-01 Thread Doug Cutting
Frédéric Bertin wrote: Indeed, I would like to have a centralized jobs repository on the HDFS where all jobs will be stored. Something like /jobs /job1 job1.xml job1.jar job2/ job2.xml job2.jar ... Then, submitting a job would be as simple as

Re: Executing hadoop binded on localhost

2006-09-11 Thread Doug Cutting
Sylvain Wallez wrote: I don't know Hadoop's internals well, but it seems to me that an additional configuration could do the trick, e.g. String itfAddr = conf.getString(ipc.server.listen.address) address = (itfAddr == null) ? new InetSocketAddress(port) : new InetSocketAddress(itfAddr,

Re: java.io.IOException: wrong value class

2006-09-14 Thread Doug Cutting
This sounds like: http://issues.apache.org/jira/browse/HADOOP-534 A patch should be committed to trunk tomorrow, and a point release will be made shortly thereafter. In the meantime, you could experiment with the 0.5.0 release. Doug Aaron Wong wrote: Hi, I'm new to hadoop. I was going

Re: Why not use Serializable?

2006-09-25 Thread Doug Cutting
Curt Cox wrote: I'm curious why the new Writable interface was chosen rather than using Serializable. The Writable interface is subtly different than Serializable. Serializable does not assume the class of stored values is known. So each instance is tagged with its class.

Re: Why not use Serializable?

2006-09-25 Thread Doug Cutting
Curt Cox wrote: Let me restate, so you can tell me if I'm wrong. Writable is used instead of Serializable, because it provides for more compact stream format and allows for easier random access. They have different semantics, but don't have a major impact on versioning. Serialization's

Re: Why not use Serializable?

2006-09-25 Thread Doug Cutting
Curt Cox wrote: In my experience, using Serialization instead of DataInput/DataOutput streams has a major impact on versioning. Serialization keeps a lot of metadata in the stream. This makes detecting format changes very easy, but can really complicate backward compatibility. FYI, Owen has

Re: Why not use Serializable?

2006-09-26 Thread Doug Cutting
Feng Jiang wrote: As for the IPC (it used to be RPC about one year ago) implementation, I think it has some performance problem. I don't know why the Listener has to read the data and prepare the Call instance then put the Call instance into a queue. The reading process may be a long time, and

Re: Why not use Serializable?

2006-09-27 Thread Doug Cutting
Feng Jiang wrote: In my implementation, I still permit the out-of-order RPC call by the same way. the only difference between my impl and your previous impl is: 1. I made use of threadpool(JDK1.5) to replace the Handler threads. I believe the JDK's impl should not be worse than ourselves, and

Re: Reducer and Keys

2006-10-02 Thread Doug Cutting
Owen O'Malley wrote: SequenceFile.Writer is more than willing to write unsorted files. However MapFile.Writer would complain, since it creates an index for random-access, and requires that the data is well sorted. So it depends on your output format: SequenceFileOutputFormat and

Re: Formatting the Namenode

2006-10-11 Thread Doug Cutting
This refers to formatting Hadoop's DFS filesystem, not formatting a linux volume. Hadoop's DFS filesystem in implemented on top the local filesystems of your cluster. Hadoop does not require reformatting of linux filesystem volumes. Formatting a Hadoop DFS filesystem simply creates a few

Re: Combining MapReduce implementations

2006-10-11 Thread Doug Cutting
Trevor Strohman wrote: Yes, this sounds very interesting. Does it build on the Record IO classes or is it completely separate? I'm afraid it's completely separate, although it's not much code. The TypeBuilder is ~600 lines of code right now, plus maybe 500 lines of additional support

Re: Combining MapReduce implementations

2006-10-16 Thread Doug Cutting
any mapreduce tasks finish and moving chunks to another box. Lee On 10/11/06, Doug Cutting [EMAIL PROTECTED] wrote: Trevor Strohman wrote: Grid Engine: All the machines available to me run Sun's Grid Engine for job submission. Grid Engine is important for us, because it makes sure that all

Re: Advice wanted

2006-10-26 Thread Doug Cutting
Andrzej Bialecki wrote: Grant Ingersoll wrote: 2. This time, instead of tokens I have X number of whole documents that need to be translated from source to destination and the way the translation systems work, it is best to have the whole document together when getting a translation. My plan

Using Hadoop on Amazon EC2

2006-10-27 Thread Doug Cutting
I just added a new wiki page describing how I was able to use Hadoop on Amazon's EC2 computing infrastructure. If others test this, please help improve it. http://wiki.apache.org/lucene-hadoop/AmazonEC2 Thanks, Doug

Re: Help in setting Hadoop on multiple servers

2006-11-06 Thread Doug Cutting
howard chen wrote: but when I stop-all --config...it show... no jobtracker to stop serverA: Login Success! serverB: Login Success! serverB: no tasktracker to stop It looks like the tasktracker crashed on startup. Login to ServerB and look in its logs to see what happened. Doug

Re: Help in setting Hadoop on multiple servers

2006-11-07 Thread Doug Cutting
howard chen wrote: 2006-11-07 21:53:35,492 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Bad mapred.job.tracker: local To run distributed, you must configure mapred.job.tracker and fs.default.name to be a host:port pairs on all

Re: why to force a single reduce task for Local Runner?

2006-11-07 Thread Doug Cutting
Feng Jiang wrote: look at the code: job.setNumReduceTasks(1); // force a single reduce task why? Is there any difficulty there to allow multiple reduce tasks? There is not a strong reason why a single reduce task is required. This code attempts to implement things as simply

Re: Mapredtest failure

2006-11-09 Thread Doug Cutting
Brendan Melville wrote: in hadoop-site.xml I had mapred.map.tasks and mapred.reduce.tasks set. Right, these parameters should be specified in mapred-default.xml, so that they do not override application code. This is a common confusion. Someday we should perhaps alter the configuration

Re: MapFile.get() has a bug?

2006-11-28 Thread Doug Cutting
Albert Chern wrote: Every time the size of the map file hits a multiple of the index interval, an index entry is written. Therefore, it is possible that an index entry is not added for the first occurrence of a key, but one of the later ones. The reader will then seek to one of those instead

Re: How to say Hadoop

2006-12-08 Thread Doug Cutting
Owen O'Malley wrote: I think Hadoop is pronounced as h a: - d u: p with the emphasis on the second syllable. (key: http://en.wikipedia.org/wiki/IPA_chart_for_English) I believe the first vowel there is properly ae (as in cat), but in rapid speech this unstressed vowel turns to a schwa, so

Re: Urgent: Production Issues

2006-12-21 Thread Doug Cutting
Jagadeesh wrote: Over the past day we have managed to migrate our clusters from 0.7.2 to 0.9.0. Thanks for sharing your experiences. Please note that there is now a 0.9.2 release. There should be no compatibility issues upgrading from 0.9.0 to 0.9.2, and a number of bugs are fixed, so I

Re: default JAVA_HOME in hadoop-env.sh

2007-01-03 Thread Doug Cutting
Shannon -jj Behrens wrote: The default JAVA_HOME in hadoop-env.sh is /usr/bin/java. This is confusing because /usr/bin/java is a binary, not a directory. On my system, this resulted in: $ hadoop namenode -format /usr/local/hadoop-install/hadoop/bin/hadoop: 122: /usr/bin/java/bin/java: not

Re: Hadoop on Ubuntu 6.10

2007-01-03 Thread Doug Cutting
Can you please file a bug in Jira for this? https://issues.apache.org/jira/browse/HADOOP Select CREATE NEW ISSUE. Create yourself a Jira account if you don't already have one. Thanks, Doug Shannon -jj Behrens wrote: I'm using Hadoop on Ubuntu 6.10. I ran into: $ start-all.sh starting

Re: HadoopStreaming

2007-01-03 Thread Doug Cutting
Shannon -jj Behrens wrote: There's no link to http://wiki.apache.org/lucene-hadoop/HadoopStreaming on http://wiki.apache.org/lucene-hadoop/. It would be really nice if there were one. Please add one. Anyone can help maintain the wiki. Simply create yourself an account and edit the page.

Re: s3

2007-01-08 Thread Doug Cutting
Tom White wrote: Any what do people think of the following. We already have a bunch of stuff up in S3 that we'd like to use as input to a hadoop mapreduce job only it wasn't put there by hadoop so it doesn't have the hadoop format where file-is-actually-a-list-of-blocks. [ ... ] The best

Re: s3

2007-01-08 Thread Doug Cutting
Tom White wrote: This sounds like a good plan. I wonder whether the existing block-based s3 scheme should be renamed (as s3block or similar) so s3 is the scheme that sores raw files as you describe? Perhaps s3fs would be best for the full FileSystem implementation, and simply s3 for direct

Re: Hadoop + Lucene integration: possible? how?

2007-01-15 Thread Doug Cutting
Andrzej Bialecki wrote: It's possible to use Hadoop DFS to host a read-only Lucene index and use it for searching (Nutch has an implementation of FSDirectory for this purpose), but the performance is not stellar ... Right, the best practice is to copy Lucene indexes to local drives in order

Re: Best practice for in memory data?

2007-01-25 Thread Doug Cutting
Johan Oskarsson wrote: Any advice on how to solve this problem? I think your current solutions sound reasonable. Would it be possible to somehow share a hashmap between tasks? Not without running multiple tasks in the same JVM. We could implement a mode where child tasks are run directly

Re: How to use MultithreadedMapRunner and MapRunner with the same hadoop-site.xml

2007-01-25 Thread Doug Cutting
Gu wrote: How can I use in some case MultithreadedMapRunner, and in some case MapRunner for different jobs? Use JobConf#setMapRunnerClass() on jobs that you want to override the default MapRunner, with, e.g. MultithreadedMapRunner. Do I have to use one hadoop-site.xml for one job? But I

Re: Large data sets

2007-02-06 Thread Doug Cutting
Konstantin Shvachko wrote: 200 bytes per file is theoretically correct, but rather optimistic :-( From a real system memory utilization I can see that HDFS uses 1.5-2K per file. And since each real file is internally represented by two files (1 real + 1 crc) the real estimate per file should

Re: Using Hadoop for Record storage

2007-04-12 Thread Doug Cutting
Andy Liu wrote: I'm exploring the possibility of using the Hadoop records framework to store these document records on disk. Here are my questions: 1. Is this a good application of the Hadoop records framework, keeping in mind that my goals are speed and scalability? I'm assuming the answer

Re: Running on multiple CPU's

2007-04-16 Thread Doug Cutting
Eelco Lempsink wrote: Inspired by http://www.mail-archive.com/[EMAIL PROTECTED]/msg02394.html I'm trying to run Hadoop on multiple CPU's, but without using HDFS. To be clear: you need some sort of shared filesystem, if not HDFS, then NFS, S3, or something else. For example, the job client

bandwidth (Was: Re: Running on multiple CPU's)

2007-04-16 Thread Doug Cutting
Please use a new subject when starting a new topic. jafarim wrote: Sorry if being off topic, but we experienced a very low bandwidth with hadoop while copying files to/from the cluster (some 1/100 comparing to plain samba share). The bandwidth did not improve at all by adding nodes to the

Re: Running on multiple CPU's

2007-04-16 Thread Doug Cutting
Ken Krugler wrote: Has anybody been using Hadoop with ZFS? Would ZFS count as a readily available shared file system that scales appropriately? Sun's ZFS? I don't think that's distributed, is it? Does it provide a single namespace across an arbitrarily large cluster? From the

Re: bandwidth (Was: Re: Running on multiple CPU's)

2007-04-16 Thread Doug Cutting
jafarim wrote: On linux and jvm6 with normal IDE disks and a giga ethernet switch with corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by using the native libs provided in the package but then we tested again with distcp. The scenario was as follows: We ran the test on a

Re: Running on multiple CPU's

2007-04-17 Thread Doug Cutting
Eelco Lempsink wrote: I'm not trying to run it on a cluster though, only on one host with multiple CPU's. So I guess the local filesystem is shared and therefore it should be fine. Yes, that should be fine. However, If I try with fs.default.name set to file:///tmp/hadoop-test/ still

Re: Serializing code to nodes: no can do?

2007-04-24 Thread Doug Cutting
Pedro Guedes wrote: For this I need to be able to register new steps in my chain and pass them to hadoop to execute as a mapreduce job. I see two choices here: 1 - build a .job archive (main-class: mycrawler, submits jobs thru JobClient) with my new steps and dependencies in the 'lib/'

Re: Many Checksum Errors

2007-05-02 Thread Doug Cutting
Dennis Kubes wrote: Do we know if this is a hardware issue. If it is possibly a software issue I can dedicate some resources to tracking down bugs. I would just need a little guidance on where to start looking? We don't know. The checksum mechanism is designed to catch hardware problems.

Re: slowness in hadoop reduce phase when using distributed mode

2007-05-03 Thread Doug Cutting
What version of Hadoop are you using? On what sort of a cluster? How big is your dataset? Doug moonwatcher wrote: hey guys, i've setup hadoop in distributed mode (jobtracker, tasktracker, and hdfs daemons), and observing that the map phase executes really quickly but the reduce phase

Re: Configuration and Hadoop cluster setup

2007-05-29 Thread Doug Cutting
Phantom wrote: (1) Set my fs.default.name set to hdfs://host:port and also specify it in the JobConf configuration. Copy my sample input file into HDFS using bin/hadoop fd -put from my local file system. I then need to specify this file to my WordCount sample as input. Should I specify this file

Re: Bad concurrency bug in 0.12.3?

2007-06-01 Thread Doug Cutting
Calvin Yu wrote: The problem seems to be with the MapTask's (MapTask.java) sort progress thread (line #196) not stopping after the sort is completed, and hence the call to join() (line# 190) never returns. This is because that thread is only catching the InterruptedException, and not checking

Re: Bad concurrency bug in 0.12.3?

2007-06-01 Thread Doug Cutting
a thread dump of the hang up. Calvin On 6/1/07, Doug Cutting [EMAIL PROTECTED] wrote: Calvin Yu wrote: The problem seems to be with the MapTask's (MapTask.java) sort progress thread (line #196) not stopping after the sort is completed, and hence the call to join() (line# 190) never returns

Re: Pipe/redirection to HDFS?

2007-06-01 Thread Doug Cutting
Mark Meissonnier wrote: Sweet. It works. Thanks Someone should put it on this wiki page http://wiki.apache.org/lucene-hadoop/hadoop-0.1-dev/bin/hadoop_dfs I don't have editing priviledges. Anyone can create themselves a wiki account and edit pages. Just use the Login button at the top of

Re: Can Hadoop MapReduce be used without using HDFS

2007-06-11 Thread Doug Cutting
Neeraj Mahajan wrote: I read from Hadoop docs that the task scheduler tries to execute the task closer to the data. Can this functionality be applied without using HDFS? How? You can subclass LocalFileSystem and override getFileCacheHints() to return the host where the file is known to be

Re: hdfsOpenFile() API

2007-06-14 Thread Doug Cutting
Phantom wrote: Which would mean that if I want to have my logs to reside in HDFS I will have to move them using copyFromLocal or some version thereof and then run Map/Reduce process against them ? Am I right ? Yes. HDFS is probably not currently suitable for directly storing log output as it

Re: MapFile inner workings

2007-06-20 Thread Doug Cutting
Every 128th key is held in memory. So if you've got 1M keys in a MapFile, then opening a MapFile.Reader would read 10k keys into memory. Binary search is used on these in-memory keys, so that a maximum of 127 entries must be scanned per random access. Doug Phantom wrote: Hi All I know

Re: map task in initializing phase for too long

2007-06-21 Thread Doug Cutting
Jun Rao wrote: I am wondering if anyone has experienced this problem. Sometimes when I ran a job, a few map tasks (often just one) hang in the initializing phase for more than 3 minutes (it normally finishes in a couple seconds). They will eventually finish, but the whole job is slowed down

Re: map task in initializing phase for too long

2007-06-21 Thread Doug Cutting
Raghu Angadi wrote: Doug Cutting wrote: Owen wrote: One side note is that all of the servers have a servlet such that if you do http://node:port/stacks you'll get a stack trace of all the threads in the server. I find that useful for remote debugging. *smile* Although if it is a task jvm

Re: Cluster efficiency

2007-06-21 Thread Doug Cutting
Mathijs Homminga wrote: Is there a way to easily determine the efficiency of my cluster? Example: - there are 5 slaves which can handle 1 task at the time each - there is one job, split into 5 sub tasks (5 maps and 5 reduces) - 4 slaves finish their tasks in 1 minute - 1 slave finishes its tasks

Re: Examples of chained MapReduce?

2007-06-22 Thread Doug Cutting
James Kennedy wrote: So far what I've had trouble finding examples of MapReduce jobs that are kicked-off by some one time process that in turn kick off other MapReduce jobs long after the initial driver process is dead. This would be more distributed and fault tolerant since it removes

Re: Multi-case dfs.name.dir

2007-06-25 Thread Doug Cutting
KrzyCube wrote: I found that File[] editFiles in FSEditLog.java , then i trace the call stack and found that it can be configured as multi-case of dfs.name.dir . Is this means the NameNode data can be split into pieces or just set replication as the number of the strings of dirs that

Re: Setting number of Maps

2007-07-03 Thread Doug Cutting
You could define an InputFormat whose InputSplits are not files, but rather simply have a field that is a complex number. The complex field would be written and read by Writable#write() and Writable#readFields. This InputFormat would ignore the input directory, since it is not a file-based

Re: Trying to run nutch: no address associated with name

2007-07-12 Thread Doug Cutting
In the slaves file, 'localhost' should only be used alone, not with other hosts, since 'localhost' is not a name that other hosts can use to refer to a host. It's equivalent to 127.0.0.1, the loopback address. So, if you're specifying more than one host, it's best to use real hostnames or IP

Re: undelete

2007-07-17 Thread Doug Cutting
Since Hadoop 0.12, if you configure fs.trash.interval to a non-zero value then 'bin/hadoop dfs -rm' will move things to a trash directory instead of immediately removing them. The Trash is periodically emptied of older items. Perhaps we should change the default value for this to 60 (one

Re: HDFS replica management

2007-07-17 Thread Doug Cutting
Phantom wrote: Here is the scenario I was concerned about. Consider three nodes in the system A, B and C which are placed say in different racks. Let us say that the disk on A fries up today. Now the blocks that were stored on A are not going to re-replicated (this is my understanding but I

Re: HDFS replica management

2007-07-17 Thread Doug Cutting
Phantom wrote: I am sure re-replication is not done on every heartbeat miss since that would be very expensive and inefficient. At the same time you cannot really tell if a node is partitioned away, crashed or just slow. Is it threshold based i.e I missed N heartbeats so re-replicate ? Yes,

Re: NameNode failover procedure

2007-07-20 Thread Doug Cutting
Andrzej Bialecki wrote: So far I learned that the secondary namenode keeps refreshing periodically its backup copies of fsimage and editlog files, and if the primary namenode disappears, it's the responsibility of the cluster admin to notice this, shut down the cluster, switch the configs

Re: Error reporting from map function

2007-07-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I've written a map task that will on occasion not compute the correct result. This can easily be detected, at which point I'd like the map task to report the error and terminate the entire map/reduce job. Does anyone know of a way I can do this? You can easily kill

Re: To solve the checksum errors on the non-ecc mem machines.

2007-08-14 Thread Doug Cutting
Daeseong Kim wrote: To solve the checksum errors on the non-ecc memory machines, I modified some codes in DFSClient.java and DataNode.java. The idea is very simple. The original CHUNK structure is {chunk size}{chunk data}{chunk size}{chunk data}... The modified CHUNK structure is {chunk

Re: Specifying external jars in the classpath for Hadoop

2007-08-14 Thread Doug Cutting
Eyal Oren wrote: As far as I understand (that's what we do anyway), you have to submit one jar that contains all your dependencies (except for dependencies on hadoop libs), including external jars. The easiest is probably to build maven/ant to build such big jar externally with all its

Re: Working with the output files of a hadoop application

2007-08-15 Thread Doug Cutting
Sebastien Rainville wrote: I am new to Hadoop. Looking at the documentation, I figured out how to write map and reduce functions but now I'm stuck... How do we work with the output file produced by the reducer? For example, the word count example produces a file with words as keys and the number

Re: Is mapred-default.xml read for dfs config?

2007-08-16 Thread Doug Cutting
Yes, that sounds correct. However it will probably change in 0.15, since so many folks have found it confusing. Exactly how it will change is still a matter of open debate. https://issues.apache.org/jira/browse/HADOOP-785 Doug Michael Bieniosek wrote: The wiki page

Hadoop release 0.14.0 available

2007-08-21 Thread Doug Cutting
New features in release 0.14.0 include: - Better checksums in HDFS. Checksums are no longer stored in parallel HDFS files, but are stored directly by datanodes alongside blocks. This is more efficient for the namenode and also improves data integrity. - Pipes: A C++ API for MapReduce -

Re: Reduce Performance

2007-08-23 Thread Doug Cutting
Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see what else could limit the performance here. The disk

Re: Problem submitting a job with hadoop 0.14.0

2007-08-24 Thread Doug Cutting
Thomas Friol wrote: Other question, : Why the 'hadoop.tmp.dir' is user.name dependant ? We need a directory that a user can write and also not to interfere with other users. If we didn't include the username, then different users would share the same tmp directory. This can cause

Re: Poly-reduce?

2007-08-24 Thread Doug Cutting
Ted Dunning wrote: It isn't hard to implement these programs as multiple fully fledged map-reduces, but it appears to me that many of them would be better expressed as something more like a map-reduce-reduce program. [ ... ] Expressed conventionally, this would have write all of the user

Re: Issues with 0.14.0...

2007-08-24 Thread Doug Cutting
Michael Stack wrote: You might try backing out the HADOOP-1708 patch. It changed the test guarding the log message you report below. HADOOP-1708 isn't in 0.14.0. Doug

Re: FW: Removing files after processing

2007-08-28 Thread Doug Cutting
I think this is related to HADOOP-1558: https://issues.apache.org/jira/browse/HADOOP-1558 Per-job cleanups that are not run clientside must be run in a separate JVM, since we, as a rule, don't run user code in long-lived daemons. Doug Stu Hood wrote: Does anyone have any ideas on this

Re: FW: Removing files after processing

2007-08-28 Thread Doug Cutting
Matt Kent wrote: I would find it useful to have some sort of listener mechanism, where you could register an object to be notified of a job completion event and then respond to it accordingly. There is a job completion notification feature. property namejob.end.notification.url/name

Re: Using Map/Reduce without HDFS?

2007-08-31 Thread Doug Cutting
mfc wrote: How can this get higher on the priority list? Even just a single appender. Fundamentally, priorities are set by those that do the work. As a volunteer organization, we can't assign tasks. Folks must volunteer to do the work. Y! has volunteered more than others on Hadoop, but

Re: Using Map/Reduce without HDFS?

2007-08-31 Thread Doug Cutting
Ted Dunning wrote: Presumably this won't be the kind of thing an outsider could do easily. There are no outsiders here, I hope! We try to conduct everything in the open, from design through implementation and testing. If you feel that you're missing discussions, please ask questions. Some

Re: Compression using Hadoop...

2007-08-31 Thread Doug Cutting
Arun C Murthy wrote: One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job,

Re: Compression using Hadoop...

2007-09-04 Thread Doug Cutting
Ted Dunning wrote: I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies. Would the committers be open to some small

Hadoop release 0.14.1 available

2007-09-05 Thread Doug Cutting
Release 0.14.1 fixes bugs in 0.14.0. For release details and downloads, visit: http://lucene.apache.org/hadoop/releases.html Thanks to all who contributed to this release! Doug

Re: rack-awareness for hdfs

2007-09-17 Thread Doug Cutting
Jeff Hammerbacher wrote: has anyone leveraged the ability of datanodes to specify which datacenter and rack they live in? if so, any evidence of performance improvements? it seems that rack-awareness is only leveraged in block replication, not in task execution. It often doesn't make a big

Re: Hadoop uses Client VM?

2007-09-18 Thread Doug Cutting
Toby DiPasquale wrote: Why does Hadoop use the Client JVM? I've been told that you should almost never use the Client JVM and instead use the Server JVM for anything even remotely long-running. Is the Server JVM less stable? It doesn't specify the client JVM, rather it just doesn't specify the

Re: JOIN-type operations with Hadoop...

2007-09-18 Thread Doug Cutting
Ted Dunning wrote: Is there any way to add our support to your proposal? Would that even help? Yes, plese. Join the incubator-general mailing list and participate in the discussion. Your opinions is welcome there. Only votes from folks on the Incubator's PMC are binding, but votes from

Re: Reduce Performance

2007-09-21 Thread Doug Cutting
Ross Boucher wrote: My cluster has 4 machines on it, so based on the recommendations on the wiki, I set my reduce count to 8. Unfortunately, the performance was less than ideal. Specifically, when the map functions had finished, I had to wait an additional 40% of the total job time just for

Re: a million log lines from one job tracker startup

2007-09-26 Thread Doug Cutting
kate rhodes wrote: It retries as fast as it can. Yes, I can see that. It seems we should either insert a call to 'sleep(1000)' at JobTracker.java line 696, or remove that while loop altogether, since JobTracker#startTracker() will already retry on a one-second interval. In the latter

Re: Hadoop Get-Together Details

2007-09-27 Thread Doug Cutting
C G wrote: Are there any other east coast developers interested in a Boston-area get together? FYI, I'll be at ApacheCon in Atlanta this November 14th 15th, which might be a good place for a Hadoop BOF. http://www.us.apachecon.com/ Doug

Re: Multicore nodes

2007-10-01 Thread Doug Cutting
Toby DiPasquale wrote: In short, yes. Hadoop's code takes advantage of multiple native threads and you can tune the level of concurrency in the system by setting mapred.map.tasks and mapred.reduce.tasks to take advantage of multiple cores on the nodes which have them. More importantly, you

  1   2   >