FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread forbbs forbbs
The hadoop version is 0.19.0. My file is larger than 64MB, and the block size is 64MB. The output of the code below is '10'. May I read across the block boundary? Or I should use 'while (left..){}' style code? public static void main(String[] args) throws IOException { Configuration conf

combine two map tasks

2009-06-28 Thread bonito
Hello! I am a new hadoop user and my question may sound naive.. However, I would like to ask if there is a way to combine the results of two mpa tasks that may run simultaneously. I use the MultipleInput class and thus I have two different mappers. I want the result/output of the one map

Re: combine two map tasks

2009-06-28 Thread bharath vissapragada
See this .. hope this answers your question . http://developer.yahoo.com/hadoop/tutorial/module4.html#tips On Sun, Jun 28, 2009 at 5:28 PM, bonito bonito.pe...@gmail.com wrote: Hello! I am a new hadoop user and my question may sound naive.. However, I would like to ask if there is a way to

Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread Raghu Angadi
This seems to be the case. I don't think there is any specific reason not to read across the block boundary... Even if HDFS does read across the blocks, it is still not a good idea to ignore the JavaDoc for read(). If you want all the bytes read, then you should have a while loop or one of

hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi. Wonder how one should improve the startup times of a hadoop job. Some of my jobs which have a lot of dependencies in terms of many jar files take a long time to start in hadoop up to 2 minutes some times. The data input amounts in these cases are neglible so it seems that Hadoop have a really

Re: hadoop jobs take long time to setup

2009-06-28 Thread tim robertson
How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job? conf.setNumTasksToExecutePerJvm(-1) Cheers, Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote: Hi. Wonder how one should

Re: combine two map tasks

2009-06-28 Thread jason hadoop
The ChainMapper class introduced in Hadoop 19 will provide you with the ability to have an arbitrary number of map tasks to run one after the other, in the context of a single job. The one issue to be aware of is that the chain of mappers only see the output the previous map in the chain. There

Re: Scaling out/up or a mix

2009-06-28 Thread Marcus Herou
Hi. The crawlers are _very_ threaded but no we use our own threading framework since it was not available at the time on hadoop-core. Crawlers normally just wait a lot on clients inducing very little CPU but consumes some memory due to the parallellism. //Marcus On Sat, Jun 27, 2009 at 6:10

Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi. Running without a jobtracker makes the job start almost instantly. I think it is due to something with the classloader. I use a huge amount of jarfiles jobConf.set(tmpjars, jar1.jar,jar2.jar)... which need to be loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will

Re: hadoop jobs take long time to setup

2009-06-28 Thread Stuart White
Although I've never done it, I believe you could manually copy your jar files out to your cluster somewhere in hadoop's classpath, and that would remove the need for you to copy them to your cluster at the start of each job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou

Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
This is the way we deal with this problem, too. We put our jar files on NFS, and the attached patch makes possible to add those jar files to the tasktracker classpath through a configuration property. Thanks, Mikhail On Sun, Jun 28, 2009 at 5:21 PM, Stuart White stuart.whi...@gmail.comwrote:

Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS since NFS can be slow as hell sometimes but what the heck we already have our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which extra local jar files to add as

Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread Matei Zaharia
This kind of partial read is often used by the OS to return to your application as soon as possible if trying to read more data would block, in case you can begin computing on the partial data. In some applications, it's not useful, but when you can begin computing on partial data, it allows the

Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread M. C. Srivas
On Sun, Jun 28, 2009 at 3:01 PM, Matei Zaharia ma...@cloudera.com wrote: This kind of partial read is often used by the OS to return to your application as soon as possible if trying to read more data would block, in case you can begin computing on the partial data. In some applications, it's

Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple jar

Re: hadoop jobs take long time to setup

2009-06-28 Thread Mikhail Bautin
Marcus, The code that needs to patched is in the tasktracker, because the tasktracker is what starts the child JVM that runs user code. Thanks, Mikhail On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou marcus.he...@tailsweep.comwrote: Hi. Just to be clear. It is the jobtracker that needs the

Re: hadoop jobs take long time to setup

2009-06-28 Thread Marcus Herou
Hi. Just to be clear. It is the jobtracker that needs the patched code right ? Or is it the tasktrackers ? Kindly //Marcus On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin mbau...@gmail.com wrote: Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into

building the eclipse plugin

2009-06-28 Thread brien colwell
hi all -- Just wondering how to build the eclipse plugin. ant binary does not seem to catch it. I would like to experiment with a few changes. thanks! Brien