The hadoop version is 0.19.0.
My file is larger than 64MB, and the block size is 64MB.
The output of the code below is '10'. May I read across the block
boundary? Or I should use 'while (left..){}' style code?
public static void main(String[] args) throws IOException
{
Configuration conf
Hello!
I am a new hadoop user and my question may sound naive..
However, I would like to ask if there is a way to combine the results of two
mpa tasks that may run simultaneously.
I use the MultipleInput class and thus I have two different mappers.
I want the result/output of the one map
See this .. hope this answers your question .
http://developer.yahoo.com/hadoop/tutorial/module4.html#tips
On Sun, Jun 28, 2009 at 5:28 PM, bonito bonito.pe...@gmail.com wrote:
Hello!
I am a new hadoop user and my question may sound naive..
However, I would like to ask if there is a way to
This seems to be the case. I don't think there is any specific reason
not to read across the block boundary...
Even if HDFS does read across the blocks, it is still not a good idea to
ignore the JavaDoc for read(). If you want all the bytes read, then you
should have a while loop or one of
Hi.
Wonder how one should improve the startup times of a hadoop job. Some of my
jobs which have a lot of dependencies in terms of many jar files take a long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that Hadoop
have a really
How long does it take to start the code locally in a single thread?
Can you reuse the JVM so it only starts once per node per job?
conf.setNumTasksToExecutePerJvm(-1)
Cheers,
Tim
On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote:
Hi.
Wonder how one should
The ChainMapper class introduced in Hadoop 19 will provide you with the
ability to have an arbitrary number of map tasks to run one after the other,
in the context of a single job.
The one issue to be aware of is that the chain of mappers only see the
output the previous map in the chain.
There
Hi.
The crawlers are _very_ threaded but no we use our own threading framework
since it was not available at the time on hadoop-core.
Crawlers normally just wait a lot on clients inducing very little CPU but
consumes some memory due to the parallellism.
//Marcus
On Sat, Jun 27, 2009 at 6:10
Hi.
Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge amount of
jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be
loaded every time I guess.
By issuing conf.setNumTasksToExecutePerJvm(-1); will
Although I've never done it, I believe you could manually copy your jar
files out to your cluster somewhere in hadoop's classpath, and that would
remove the need for you to copy them to your cluster at the start of each
job.
On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou
This is the way we deal with this problem, too. We put our jar files on NFS,
and the attached patch makes possible to add those jar files to the
tasktracker classpath through a configuration property.
Thanks,
Mikhail
On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote:
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS
since NFS can be slow as hell sometimes but what the heck we already have
our maven2 repo on NFS so why not :)
Are you saying that this patch make the client able to configure which
extra local jar files to add as
This kind of partial read is often used by the OS to return to your
application as soon as possible if trying to read more data would block, in
case you can begin computing on the partial data. In some applications, it's
not useful, but when you can begin computing on partial data, it allows the
On Sun, Jun 28, 2009 at 3:01 PM, Matei Zaharia ma...@cloudera.com wrote:
This kind of partial read is often used by the OS to return to your
application as soon as possible if trying to read more data would block, in
case you can begin computing on the partial data. In some applications,
it's
Marcus,
We currently use 0.20.0 but this patch just inserts 8 lines of code into
TaskRunner.java, which could certainly be done with 0.18.3.
Yes, this patch just appends additional jars to the child JVM classpath.
I've never really used tmpjars myself, but if it involves uploading multiple
jar
Marcus,
The code that needs to patched is in the tasktracker, because the
tasktracker is what starts the child JVM that runs user code.
Thanks,
Mikhail
On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.comwrote:
Hi.
Just to be clear. It is the jobtracker that needs the
Hi.
Just to be clear. It is the jobtracker that needs the patched code right ?
Or is it the tasktrackers ?
Kindly
//Marcus
On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote:
Marcus,
We currently use 0.20.0 but this patch just inserts 8 lines of code into
hi all --
Just wondering how to build the eclipse plugin. ant binary does not
seem to catch it. I would like to experiment with a few changes.
thanks!
Brien
18 matches
Mail list logo