Hi! We're currently doing some work on Facbook's Scribe server for writing to HDFS. We would like to setup scribe in such a way that when HDFS becomes unavailable, it logs to local disk and replays log to HDFS when the HDFS cluster becomes available again. Currently, scribe uses libhdfs for writing to HDFS. However, it seems that both libhdfs and the java client don't expose a lot of functionality to query the state of the HDFS client.
Digging in the code gives us the following: In Client.java (line 307): } catch (SocketTimeoutException toe) { /* The max number of retries is 45, * which amounts to 20s*45 = 15 minutes retries. */ handleConnectionFailure(timeoutFailures++, 45, toe); } The java client just retries for 45 minutes and then throws an exception. There is some code in hfds.c that tries to catch an exception from the java client. This is in hdfs.c line 1005: if (invokeMethod(env, NULL, &jExc, INSTANCE, jOutputStream, HADOOP_OSTRM, "write", "([B)V", jbWarray) != 0) { errno = errnoFromException(jExc, env, "org.apache.hadoop.fs." "FSDataOutputStream::write"); length = -1; } However, for us, 45 minutes is a long time. Normally libhdfs returns the amount of bytes input as the length, which causes our log messages to be cached somewhere in memory and scribe assumes that the write when through. What I would like is that the Java client sets a status when it can't write to HDFS that can be accessed by libhdfs and that it returns -1 when it couldn't write. Scribe should then be able to failover to local disk. Is this something that we could implement, or am I missing something? I understand that there are a lot of other use-cases that might break with my suggested behavior. Can anyone enlighten me? Regards, Wouter