Jeff Ritchie wrote:
Hadoop wiki could use some configuration ;)
It sure could! I requested that it be created back in January:
https://issues.apache.org/jira/browse/INFRA-701
But for some reason no new Apache wiki's have been created since then...
Doug
Yonik Seeley wrote:
The JavaDoc points out one: single-writer, stream only (no record
append, no writing to specific spot in file, etc). Is that a
different design decision, or simply something that hasn't been
implemented yet?
It's a simplification. We may add appends multiple writers
Aled Jones wrote:
Anyhoo, I'm fairly new to hadoop and was wondering about the redundancy
aspects of it. If I have a few servers running for nutch, one being a
name and data node, the others just data nodes, what happens when the
name node falls over? To get proper redundancy in a hadoop
Raghavendra Prabhu wrote:
Is pure Winxp operation supported now?(ie that is without Cygwin ) as df is
supported now (Thanx to the group)
No, cygwin is still required.
Maybe this is isolated case in winxp operation. But can some one check and
confirm?. It would be helpful
I try to
Raghavendra Prabhu wrote:
I would like to contribute
I will try to write one in my spare time ( the time spent in something other
than comprehending the architecture)
Great!
Please read the contribution instructions on the wiki:
http://wiki.apache.org/lucene-hadoop/HowToContribute
Thanks,
ennis Kubes wrote:
I keep seeing references to job.jar files. Can someone explain what the
job.jar files are and are they only used in distributed mode?
They are only required for distributed operation. They permit a job to
provide code that is not installed on all nodes. In general, user
Scott Simpson wrote:
Excuse my ignorance on this issue. Say I have 5 machines in my Hadoop
cluster and I only list two of them in the configuration file when I do a
fetch or a generate. Won't this just store the data on the two nodes
since that is all I've listed for my crawling machines? I'm
Vijay Murthi wrote:
Are you running the current trunk? My guess is that you are. If so,
then this error is normal, things should keep running.
I am using hadoop-0.2.0. I believe this is the current trunk.
No, that's a release. The trunk is what's currently in Subversion.
I used to
think
The easiest way would be to not use anything but your reliable machines
as datanodes. Alternately, for better performance, you could run two
DFS systems, one on all machines, and one on just the reliable machines,
and back one up to the other before you shutdown the unreliable nodes
each
Paul Sutter wrote:
it should be possible to have lots of tasks in the shuffle phase
(mostly, sitting around waiting for mappers to run), but only have
about one actual reduce phase running per cpu (or whatever works for
each of our apps) that gets enough memory for a sorter, does
substantial
You don't want to use DFS on top of NFS. If you use DFS, keep its data
on the local drives, not in NFS. If you want to use NFS for shared
data, then simply don't use DFS: specify local as the filesystem and
don't start datanodes or a namenode.
I think you'll find DFS will perform better
Konstantin Shvachko wrote:
On the logging issue. I think we should change the default logging level,
which is INFO at the moment.
I think INFO is the appropriate default logging level. If there are
things logged at the INFO level that are too verbose, then we should
change these to DEBUG
Renaud Richardet wrote:
Does Hadoop requires java 5?
Yes. We're not yet extensively using or encouraging Java 5 features,
but it is now required.
I get a compile error when building the
trunk with java 1.4. This change below will make it build again.
I think there are more changes
Yoram Arnon wrote:
User code data gets written to the tasktracker's log at the INFO level.
We switched to WARNING level when a rogue user program produced a lot of
output to stdout, and it
filled the task trackers' logs with junk.
Another approach might be to log warning fatal messages to a
Hadoop uses Commons Logging:
http://jakarta.apache.org/commons/logging/
One should be able to configure it to use other logging backends or a
null logger:
http://jakarta.apache.org/commons/logging/commons-logging-1.1/guide.html#Configuration
Please tell us how this works.
Doug
Dilma
Has anyone tried running Hadoop on the Amazon Elastic Compute Cloud yet?
http://www.amazon.com/gp/browse.html?node=201590011
One way to use Hadoop on this would be to:
1. Allocate a pool of machines.
2. Start Hadoop daemons.
3. Load the HDFS filesystem with input from Amazon S3.
4. Run a
To generate a single output file, specify just a single reduce task. If
your reducer isn't doing much computation, then it might be faster to do
this in the original job, otherwise use a subsequent job.
Doug
Dennis Kubes wrote:
This is probably a simple question but when I run my MR job I am
Frédéric Bertin wrote:
This should run clientside, since it depends on the username, which is
different on the server.
then, what about passing the username as a parameter to the
JobSubmissionProtocol.submitJob(...) ? This avoids loading the whole
JobConf clientside just to set the username.
Eric Baldeschwieler wrote:
Also the thread I started last week on using URLs in general for input
arguments. Seems like we should just take a URL for the jar, which
could be file: or hdfs:
That would work. The jobclient could automatically copy file: urls to
the jobtracker's native fs.
Frédéric Bertin wrote:
Indeed, I would like to have a centralized jobs repository on the HDFS
where all jobs will be stored. Something like
/jobs
/job1
job1.xml
job1.jar
job2/
job2.xml
job2.jar
...
Then, submitting a job would be as simple as
Sylvain Wallez wrote:
I don't know Hadoop's internals well, but it seems to me that an
additional configuration could do the trick, e.g.
String itfAddr = conf.getString(ipc.server.listen.address)
address = (itfAddr == null) ? new InetSocketAddress(port) : new
InetSocketAddress(itfAddr,
This sounds like:
http://issues.apache.org/jira/browse/HADOOP-534
A patch should be committed to trunk tomorrow, and a point release will
be made shortly thereafter. In the meantime, you could experiment with
the 0.5.0 release.
Doug
Aaron Wong wrote:
Hi,
I'm new to hadoop. I was going
Curt Cox wrote:
I'm curious why the new Writable interface was chosen rather than
using Serializable.
The Writable interface is subtly different than Serializable.
Serializable does not assume the class of stored values is known. So
each instance is tagged with its class.
Curt Cox wrote:
Let me restate, so you can tell me if I'm wrong. Writable is used
instead of Serializable, because it provides for more compact stream
format and allows for easier random access. They have different
semantics, but don't have a major impact on versioning.
Serialization's
Curt Cox wrote:
In my experience, using Serialization instead of DataInput/DataOutput
streams has a major impact on versioning. Serialization keeps a lot
of metadata in the stream. This makes detecting format changes very
easy, but can really complicate backward compatibility.
FYI, Owen has
Feng Jiang wrote:
As for the IPC (it used to be RPC about one year ago) implementation, I
think it has some performance problem. I don't know why the Listener has to
read the data and prepare the Call instance then put the Call instance into
a queue. The reading process may be a long time, and
Feng Jiang wrote:
In my implementation, I still permit the out-of-order RPC call by the same
way. the only difference between my impl and your previous impl is:
1. I made use of threadpool(JDK1.5) to replace the Handler threads. I
believe the JDK's impl should not be worse than ourselves, and
Owen O'Malley wrote:
SequenceFile.Writer is
more than willing to write unsorted files.
However MapFile.Writer would complain, since it creates an index for
random-access, and requires that the data is well sorted. So it depends
on your output format: SequenceFileOutputFormat and
This refers to formatting Hadoop's DFS filesystem, not formatting a
linux volume. Hadoop's DFS filesystem in implemented on top the local
filesystems of your cluster. Hadoop does not require reformatting of
linux filesystem volumes. Formatting a Hadoop DFS filesystem simply
creates a few
Trevor Strohman wrote:
Yes, this sounds very interesting. Does it build on the Record IO
classes or is it completely separate?
I'm afraid it's completely separate, although it's not much code. The
TypeBuilder is ~600 lines of code right now, plus maybe 500 lines of
additional support
any mapreduce tasks finish and moving chunks to another
box.
Lee
On 10/11/06, Doug Cutting [EMAIL PROTECTED] wrote:
Trevor Strohman wrote:
Grid Engine: All the machines available to me run Sun's Grid Engine for
job submission. Grid Engine is important for us, because it makes sure
that all
Andrzej Bialecki wrote:
Grant Ingersoll wrote:
2. This time, instead of tokens I have X number of whole documents
that need to be translated from source to destination and the way the
translation systems work, it is best to have the whole document
together when getting a translation. My plan
I just added a new wiki page describing how I was able to use Hadoop on
Amazon's EC2 computing infrastructure. If others test this, please help
improve it.
http://wiki.apache.org/lucene-hadoop/AmazonEC2
Thanks,
Doug
howard chen wrote:
but when I stop-all --config...it show...
no jobtracker to stop
serverA: Login Success!
serverB: Login Success!
serverB: no tasktracker to stop
It looks like the tasktracker crashed on startup. Login to ServerB and
look in its logs to see what happened.
Doug
howard chen wrote:
2006-11-07 21:53:35,492 ERROR org.apache.hadoop.mapred.TaskTracker:
Can not start task tracker because java.lang.RuntimeException: Bad
mapred.job.tracker: local
To run distributed, you must configure mapred.job.tracker and
fs.default.name to be a host:port pairs on all
Feng Jiang wrote:
look at the code:
job.setNumReduceTasks(1); // force a single reduce task
why? Is there any difficulty there to allow multiple reduce tasks?
There is not a strong reason why a single reduce task is required. This
code attempts to implement things as simply
Brendan Melville wrote:
in hadoop-site.xml I had mapred.map.tasks and mapred.reduce.tasks set.
Right, these parameters should be specified in mapred-default.xml, so
that they do not override application code. This is a common confusion.
Someday we should perhaps alter the configuration
Albert Chern wrote:
Every time the size of the map file hits a multiple of the index
interval, an index entry is written. Therefore, it is possible that
an index entry is not added for the first occurrence of a key, but one
of the later ones. The reader will then seek to one of those instead
Owen O'Malley wrote:
I think Hadoop is pronounced as h a: - d u: p with the emphasis on the
second syllable.
(key: http://en.wikipedia.org/wiki/IPA_chart_for_English)
I believe the first vowel there is properly ae (as in cat), but in
rapid speech this unstressed vowel turns to a schwa, so
Jagadeesh wrote:
Over the past day we have managed to migrate our clusters from 0.7.2 to
0.9.0.
Thanks for sharing your experiences.
Please note that there is now a 0.9.2 release. There should be no
compatibility issues upgrading from 0.9.0 to 0.9.2, and a number of bugs
are fixed, so I
Shannon -jj Behrens wrote:
The default JAVA_HOME in hadoop-env.sh is /usr/bin/java. This is
confusing because /usr/bin/java is a binary, not a directory. On my
system, this resulted in:
$ hadoop namenode -format
/usr/local/hadoop-install/hadoop/bin/hadoop: 122:
/usr/bin/java/bin/java: not
Can you please file a bug in Jira for this?
https://issues.apache.org/jira/browse/HADOOP
Select CREATE NEW ISSUE. Create yourself a Jira account if you don't
already have one.
Thanks,
Doug
Shannon -jj Behrens wrote:
I'm using Hadoop on Ubuntu 6.10. I ran into:
$ start-all.sh
starting
Shannon -jj Behrens wrote:
There's no link to http://wiki.apache.org/lucene-hadoop/HadoopStreaming on
http://wiki.apache.org/lucene-hadoop/. It would be really nice if
there were one.
Please add one. Anyone can help maintain the wiki. Simply create
yourself an account and edit the page.
Tom White wrote:
Any what do people think of the following. We already have a bunch of
stuff up in S3 that we'd like to use as input to a hadoop mapreduce job
only it wasn't put there by hadoop so it doesn't have the hadoop format
where file-is-actually-a-list-of-blocks. [ ... ]
The best
Tom White wrote:
This sounds like a good plan. I wonder whether the existing
block-based s3 scheme should be renamed (as s3block or similar) so s3
is the scheme that sores raw files as you describe?
Perhaps s3fs would be best for the full FileSystem implementation, and
simply s3 for direct
Andrzej Bialecki wrote:
It's possible to use Hadoop DFS to host a read-only Lucene index and use
it for searching (Nutch has an implementation of FSDirectory for this
purpose), but the performance is not stellar ...
Right, the best practice is to copy Lucene indexes to local drives in
order
Johan Oskarsson wrote:
Any advice on how to solve this problem?
I think your current solutions sound reasonable.
Would it be possible to somehow share a hashmap between tasks?
Not without running multiple tasks in the same JVM. We could implement
a mode where child tasks are run directly
Gu wrote:
How can I use in some case MultithreadedMapRunner, and in some case
MapRunner for different jobs?
Use JobConf#setMapRunnerClass() on jobs that you want to override the
default MapRunner, with, e.g. MultithreadedMapRunner.
Do I have to use one hadoop-site.xml for one job? But I
Konstantin Shvachko wrote:
200 bytes per file is theoretically correct, but rather optimistic :-(
From a real system memory utilization I can see that HDFS uses 1.5-2K
per file.
And since each real file is internally represented by two files (1 real
+ 1 crc) the real
estimate per file should
Andy Liu wrote:
I'm exploring the possibility of using the Hadoop records framework to
store
these document records on disk. Here are my questions:
1. Is this a good application of the Hadoop records framework, keeping in
mind that my goals are speed and scalability? I'm assuming the answer
Eelco Lempsink wrote:
Inspired by
http://www.mail-archive.com/[EMAIL PROTECTED]/msg02394.html
I'm trying to run Hadoop on multiple CPU's, but without using HDFS.
To be clear: you need some sort of shared filesystem, if not HDFS, then
NFS, S3, or something else. For example, the job client
Please use a new subject when starting a new topic.
jafarim wrote:
Sorry if being off topic, but we experienced a very low bandwidth with
hadoop while copying files to/from the cluster (some 1/100 comparing to
plain samba share). The bandwidth did not improve at all by adding nodes to
the
Ken Krugler wrote:
Has anybody been using Hadoop with ZFS? Would ZFS count as a readily
available shared file system that scales appropriately?
Sun's ZFS? I don't think that's distributed, is it? Does it provide a
single namespace across an arbitrarily large cluster? From the
jafarim wrote:
On linux and jvm6 with normal IDE disks and a giga ethernet switch with
corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by
using the native libs provided in the package but then we tested again with
distcp. The scenario was as follows:
We ran the test on a
Eelco Lempsink wrote:
I'm not trying to run it on a cluster though, only on one host with
multiple CPU's. So I guess the local filesystem is shared and therefore
it should be fine.
Yes, that should be fine.
However, If I try with fs.default.name set to file:///tmp/hadoop-test/
still
Pedro Guedes wrote:
For this I need to be able to register new steps in my chain and pass
them to hadoop to execute as a mapreduce job. I see two choices here:
1 - build a .job archive (main-class: mycrawler, submits jobs thru
JobClient) with my new steps and dependencies in the 'lib/'
Dennis Kubes wrote:
Do we know if this is a hardware issue. If it is possibly a software
issue I can dedicate some resources to tracking down bugs. I would just
need a little guidance on where to start looking?
We don't know. The checksum mechanism is designed to catch hardware
problems.
What version of Hadoop are you using? On what sort of a cluster? How
big is your dataset?
Doug
moonwatcher wrote:
hey guys,
i've setup hadoop in distributed mode (jobtracker, tasktracker, and hdfs daemons), and observing that the map phase executes really quickly but the reduce phase
Phantom wrote:
(1) Set my fs.default.name set to hdfs://host:port and also specify it
in the JobConf configuration. Copy my sample input file into HDFS using
bin/hadoop fd -put from my local file system. I then need to specify this
file to my WordCount sample as input. Should I specify this file
Calvin Yu wrote:
The problem seems to be with the MapTask's (MapTask.java) sort
progress thread (line #196) not stopping after the sort is completed,
and hence the call to join() (line# 190) never returns. This is
because that thread is only catching the InterruptedException, and not
checking
a
thread dump of the hang up.
Calvin
On 6/1/07, Doug Cutting [EMAIL PROTECTED] wrote:
Calvin Yu wrote:
The problem seems to be with the MapTask's (MapTask.java) sort
progress thread (line #196) not stopping after the sort is completed,
and hence the call to join() (line# 190) never returns
Mark Meissonnier wrote:
Sweet. It works. Thanks
Someone should put it on this wiki page
http://wiki.apache.org/lucene-hadoop/hadoop-0.1-dev/bin/hadoop_dfs
I don't have editing priviledges.
Anyone can create themselves a wiki account and edit pages. Just use
the Login button at the top of
Neeraj Mahajan wrote:
I read from Hadoop docs that the task scheduler tries to execute the task
closer to the data. Can this functionality be applied without using HDFS?
How?
You can subclass LocalFileSystem and override getFileCacheHints() to
return the host where the file is known to be
Phantom wrote:
Which would mean that if I want to have my logs to reside in HDFS I will
have to move them using copyFromLocal or some version thereof and then run
Map/Reduce process against them ? Am I right ?
Yes. HDFS is probably not currently suitable for directly storing log
output as it
Every 128th key is held in memory. So if you've got 1M keys in a
MapFile, then opening a MapFile.Reader would read 10k keys into memory.
Binary search is used on these in-memory keys, so that a maximum of
127 entries must be scanned per random access.
Doug
Phantom wrote:
Hi All
I know
Jun Rao wrote:
I am wondering if anyone has experienced this problem. Sometimes when I
ran a job, a few map tasks (often just one) hang in the initializing phase
for more than 3 minutes (it normally finishes in a couple seconds). They
will eventually finish, but the whole job is slowed down
Raghu Angadi wrote:
Doug Cutting wrote:
Owen wrote:
One side note is that all of the servers have a servlet such that if
you do http://node:port/stacks you'll get a stack trace of all
the threads in the server. I find that useful for remote debugging.
*smile* Although if it is a task jvm
Mathijs Homminga wrote:
Is there a way to easily determine the efficiency of my cluster?
Example:
- there are 5 slaves which can handle 1 task at the time each
- there is one job, split into 5 sub tasks (5 maps and 5 reduces)
- 4 slaves finish their tasks in 1 minute
- 1 slave finishes its tasks
James Kennedy wrote:
So far what I've had trouble finding examples of MapReduce jobs that are
kicked-off by some one time process that in turn kick off other
MapReduce jobs long after the initial driver process is dead. This
would be more distributed and fault tolerant since it removes
KrzyCube wrote:
I found that File[] editFiles in FSEditLog.java , then i trace the
call stack and found that it can be configured as multi-case of
dfs.name.dir . Is this means the NameNode data can be split into pieces or
just set replication as the number of the strings of dirs that
You could define an InputFormat whose InputSplits are not files, but
rather simply have a field that is a complex number. The complex field
would be written and read by Writable#write() and Writable#readFields.
This InputFormat would ignore the input directory, since it is not a
file-based
In the slaves file, 'localhost' should only be used alone, not with
other hosts, since 'localhost' is not a name that other hosts can use to
refer to a host. It's equivalent to 127.0.0.1, the loopback address.
So, if you're specifying more than one host, it's best to use real
hostnames or IP
Since Hadoop 0.12, if you configure fs.trash.interval to a non-zero
value then 'bin/hadoop dfs -rm' will move things to a trash directory
instead of immediately removing them. The Trash is periodically emptied
of older items. Perhaps we should change the default value for this to
60 (one
Phantom wrote:
Here is the scenario I was concerned about. Consider three nodes in the
system A, B and C which are placed say in different racks. Let us say that
the disk on A fries up today. Now the blocks that were stored on A are not
going to re-replicated (this is my understanding but I
Phantom wrote:
I am sure re-replication is not done on every heartbeat miss since that
would be very expensive and inefficient. At the same time you cannot really
tell if a node is partitioned away, crashed or just slow. Is it threshold
based i.e I missed N heartbeats so re-replicate ?
Yes,
Andrzej Bialecki wrote:
So far I learned that the secondary namenode keeps refreshing
periodically its backup copies of fsimage and editlog files, and if the
primary namenode disappears, it's the responsibility of the cluster
admin to notice this, shut down the cluster, switch the configs
[EMAIL PROTECTED] wrote:
I've written a map task that will on occasion not compute the correct
result. This can easily be detected, at which point I'd like the map
task to report the error and terminate the entire map/reduce job. Does
anyone know of a way I can do this?
You can easily kill
Daeseong Kim wrote:
To solve the checksum errors on the non-ecc memory machines, I
modified some codes in DFSClient.java and DataNode.java.
The idea is very simple.
The original CHUNK structure is
{chunk size}{chunk data}{chunk size}{chunk data}...
The modified CHUNK structure is
{chunk
Eyal Oren wrote:
As far as I understand (that's what we do anyway), you have to submit
one jar that contains all your dependencies (except for dependencies on
hadoop libs), including external jars. The easiest is probably to build
maven/ant to build such big jar externally with all its
Sebastien Rainville wrote:
I am new to Hadoop. Looking at the documentation, I figured out how to
write map and reduce functions but now I'm stuck... How do we work with
the output file produced by the reducer? For example, the word count
example produces a file with words as keys and the number
Yes, that sounds correct. However it will probably change in 0.15,
since so many folks have found it confusing. Exactly how it will change
is still a matter of open debate.
https://issues.apache.org/jira/browse/HADOOP-785
Doug
Michael Bieniosek wrote:
The wiki page
New features in release 0.14.0 include:
- Better checksums in HDFS. Checksums are no longer stored in parallel
HDFS files, but are stored directly by datanodes alongside blocks. This
is more efficient for the namenode and also improves data integrity.
- Pipes: A C++ API for MapReduce
-
Thorsten Schuett wrote:
During the copy phase of reduce, the cpu load was very low and vmstat showed
constant reads from the disk at ~15MB/s and bursty writes. At the same time,
data was sent over the loopback device at ~15MB/s. I don't see what else
could limit the performance here. The disk
Thomas Friol wrote:
Other question, : Why the 'hadoop.tmp.dir' is user.name dependant ?
We need a directory that a user can write and also not to interfere with
other users. If we didn't include the username, then different users
would share the same tmp directory. This can cause
Ted Dunning wrote:
It isn't hard to implement these programs as multiple fully fledged
map-reduces, but it appears to me that many of them would be better
expressed as something more like a map-reduce-reduce program.
[ ... ]
Expressed conventionally, this would have write all of the user
Michael Stack wrote:
You might try backing out the HADOOP-1708 patch. It changed the test
guarding the log message you report below.
HADOOP-1708 isn't in 0.14.0.
Doug
I think this is related to HADOOP-1558:
https://issues.apache.org/jira/browse/HADOOP-1558
Per-job cleanups that are not run clientside must be run in a separate
JVM, since we, as a rule, don't run user code in long-lived daemons.
Doug
Stu Hood wrote:
Does anyone have any ideas on this
Matt Kent wrote:
I would find it useful to have some sort of listener mechanism, where
you could register an object to be notified of a job completion event
and then respond to it accordingly.
There is a job completion notification feature.
property
namejob.end.notification.url/name
mfc wrote:
How can this get higher on the priority list? Even just a single appender.
Fundamentally, priorities are set by those that do the work. As a
volunteer organization, we can't assign tasks. Folks must volunteer to
do the work. Y! has volunteered more than others on Hadoop, but
Ted Dunning wrote:
Presumably this won't be the kind of thing an outsider could do easily.
There are no outsiders here, I hope! We try to conduct everything in
the open, from design through implementation and testing. If you feel
that you're missing discussions, please ask questions. Some
Arun C Murthy wrote:
One way to reap benefits of both compression and better parallelism is to use
compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
Of course this means you will have to do a conversion from .gzip to .seq file
and load it onto hdfs for your job,
Ted Dunning wrote:
I have to say, btw, that the source tree structure of this project is pretty
ornate and not very parallel. I needed to add 10 source roots in IntelliJ to
get a clean compile. In this process, I noticed some circular dependencies.
Would the committers be open to some small
Release 0.14.1 fixes bugs in 0.14.0.
For release details and downloads, visit:
http://lucene.apache.org/hadoop/releases.html
Thanks to all who contributed to this release!
Doug
Jeff Hammerbacher wrote:
has anyone leveraged the ability of datanodes to specify which datacenter
and rack they live in? if so, any evidence of performance improvements? it
seems that rack-awareness is only leveraged in block replication, not in
task execution.
It often doesn't make a big
Toby DiPasquale wrote:
Why does Hadoop use the Client JVM? I've been told that you should
almost never use the Client JVM and instead use the Server JVM for
anything even remotely long-running. Is the Server JVM less stable?
It doesn't specify the client JVM, rather it just doesn't specify the
Ted Dunning wrote:
Is there any way to add our support to your proposal? Would that even help?
Yes, plese. Join the incubator-general mailing list and participate in
the discussion. Your opinions is welcome there. Only votes from folks
on the Incubator's PMC are binding, but votes from
Ross Boucher wrote:
My cluster has 4 machines on it, so based on the recommendations on the
wiki, I set my reduce count to 8. Unfortunately, the performance was
less than ideal. Specifically, when the map functions had finished, I
had to wait an additional 40% of the total job time just for
kate rhodes wrote:
It retries as fast as it can.
Yes, I can see that. It seems we should either insert a call to
'sleep(1000)' at JobTracker.java line 696, or remove that while loop
altogether, since JobTracker#startTracker() will already retry on a
one-second interval. In the latter
C G wrote:
Are there any other east coast developers interested in a Boston-area get
together?
FYI, I'll be at ApacheCon in Atlanta this November 14th 15th, which
might be a good place for a Hadoop BOF.
http://www.us.apachecon.com/
Doug
Toby DiPasquale wrote:
In short, yes. Hadoop's code takes advantage of multiple native
threads and you can tune the level of concurrency in the system by
setting mapred.map.tasks and mapred.reduce.tasks to take advantage of
multiple cores on the nodes which have them.
More importantly, you
1 - 100 of 133 matches
Mail list logo