Once you close it, the HDFS daemons own the file and make sure it's copied around. Allowing reopens at this point makes that distribution control that much more complex: asynchronous processes have to agree that the old file now is longer.

Another thing to keep in mind is that HDFS has blocksizes in the megabytes- 64m, 128m are common. An HDFS file should be designed to be maybe 90% of this size when write it and close it.

Chittaranjan Hota wrote:
Hi Steve,
Thanks for the inputs.
I could understand by now, that the files are "immutable". Wanted to confirm. However little confused as to what role the "append" methods are?

I am now going to explore and see how it works out when I keep a stream open and write data to it and close on an interval basis.

Thanks again.

Regards,
Chitta
MobileMe Reporting
Ext: 21294
Direct: 408-862-1294

On Sep 17, 2010, at 2:57 PM, Steve Hoffman wrote:

This is a "feature" of HDFS.  Files are immutable.
You have to create a new file.  The file you are writing to isn't
available in hdfs until you close it.
Usually you'll have something buffering pieces and writing to hdfs.
Then you can roll those smaller files into larger chunks using a
nightly map-reduce job or something else.

You might want to look at the Flume project from cloudera (there are
others as well) and just log4j to local disk.  Then use flume agents
to send to a collector (or collectors) which write to hdfs on an
interval or other criteria.  Facebook's Scribe and Apache Chuckwa are
also contenders for these tasks.

Log collection seems to be a common use of hadoop these days.
If you google it, you'll find plenty of stuff.
Also (shameless plug for a presentation I just gave on this topic):
http://bit.ly/hoffmanchug20100915

Hope this helps!
Steve

On Fri, Sep 17, 2010 at 1:43 PM, Chittaranjan Hota <[email protected]> wrote:
Hello,
I am new to Hadoop and to this forum.

Existing setup:
Basically we have an existing set up where data is collected from a JMS Q
and written on to hard disk without Hadoop. Typcial I/O using log4j.

Problem Statement:
Now instead of writing it to hard disk, I would like to stream it to HDFS, I know thats possible using the "FileSystem" class and create method. Did a
small POC on that as well.
However I am not able to append to the created files.

It throws the exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Append to hdfs not supported. Please refer to dfs.support.append configuration parameter.


Am looking for any pointers/suggestions to resolve this?
Please let me know if you need any further information.

Thanks in advance.

Regards,
Chitta
MobileMe Reporting
Ext: 21294
Direct: 408-862-1294



Reply via email to