This is exactly the case, my threads share the same output collector. So I don't create multiple instances of output collector myself rather use the one that is recieved in reduce().
Here is the stack trace:-

Exception in thread "pool-2-thread-1" java.lang.RuntimeException: Error while collecting output to HDFS
   at com.aol.urlDB.dbfacade.DiscWriteTask.run(DiscWriteTask.java:75)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
   at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /UrlStats/_temporary/_task_200809111042_0004_r_000000_0/URL/USER_ID/USER_ID_0 for DFSClient_task_200809111042_0004_r_000000_0 on client 10.146.163.143 because current leaseholder is trying to recreate file. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:1010)
   at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:967)
   at org.apache.hadoop.dfs.NameNode.create(NameNode.java:269)
   at sun.reflect.GeneratedMethodAccessor195.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

   at org.apache.hadoop.ipc.Client.call(Client.java:557)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
   at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source)
   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
   at org.apache.hadoop.dfs.$Proxy1.create(Unknown Source)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2189)
   at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:479)
at org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:138)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:508)
   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:408)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:111) at org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:46)
   ... 3 more
Exception in thread "pool-2-thread-2" java.lang.RuntimeException: Error while collecting output to HDFS
...
...
...

Thanks
-Ankur

Lohit wrote:
I might be wrong but my guess is this. This exception might be from the underneath layer of dfs. Output creates a file and in your case there might me multiple create requests. Can your threads share output collector?
Sent from my iPhone

On Sep 8, 2008, at 12:51 AM, "Goel, Ankur" <[EMAIL PROTECTED]> wrote:

Hi Folks,

            I have a setup where I am using a thread-pool
implementation (provided by java.util.concurrent package) in reduce
phase to do database I/O to speed up the database upload. The DB here is
MySQL. I decided to go for additional parallelism via threads as
1. It considerably speeds up the upload while consuming less cluster
resources (i.e. less number of reducers required).
2. The upload speed is not limited by the reduce task capacity of the
cluster but by the DB's ability to handle max connections simultaneously
and effectively.



Each reduce task has 2 thread pools. One that does the DB I/O and whose
return a java.util.concurrent.FutureTask. Another pool that fetches
result from this future task to do disc I/O i.e.
outputCollector.collect(...).




When multiple threads from the second pool try to do a disc I/O, I get
an AlreadyBeingCreatedException in the logs. If I set the second pool to
only have 1 thread then things work fine!



It looks like the output collector was never assumed to be used from
multiple threads.



Any thoughts on this?



Thanks

-Ankur





Reply via email to