Re: Appending to existing files in HDFS

Lance Norskog Fri, 17 Sep 2010 20:22:33 -0700

Once you close it, the HDFS daemons own the file and make sure it'scopied around. Allowing reopens at this point makes that distributioncontrol that much more complex: asynchronous processes have to agreethat the old file now is longer.

Another thing to keep in mind is that HDFS has blocksizes in themegabytes- 64m, 128m are common. An HDFS file should be designed to bemaybe 90% of this size when write it and close it.


Chittaranjan Hota wrote:

Hi Steve,
Thanks for the inputs.
I could understand by now, that the files are "immutable". Wanted toconfirm. However little confused as to what role the "append" methodsare?
I am now going to explore and see how it works out when I keep astream open and write data to it and close on an interval basis.
Thanks again.

Regards,
Chitta
MobileMe Reporting
Ext: 21294
Direct: 408-862-1294

On Sep 17, 2010, at 2:57 PM, Steve Hoffman wrote:
This is a "feature" of HDFS.  Files are immutable.
You have to create a new file.  The file you are writing to isn't
available in hdfs until you close it.
Usually you'll have something buffering pieces and writing to hdfs.
Then you can roll those smaller files into larger chunks using a
nightly map-reduce job or something else.

You might want to look at the Flume project from cloudera (there are
others as well) and just log4j to local disk.  Then use flume agents
to send to a collector (or collectors) which write to hdfs on an
interval or other criteria.  Facebook's Scribe and Apache Chuckwa are
also contenders for these tasks.

Log collection seems to be a common use of hadoop these days.
If you google it, you'll find plenty of stuff.
Also (shameless plug for a presentation I just gave on this topic):
http://bit.ly/hoffmanchug20100915

Hope this helps!
Steve
On Fri, Sep 17, 2010 at 1:43 PM, Chittaranjan Hota <[email protected]>wrote:
Hello,
I am new to Hadoop and to this forum.

Existing setup:
Basically we have an existing set up where data is collected from aJMS Q
and written on to hard disk without Hadoop. Typcial I/O using log4j.

Problem Statement:
Now instead of writing it to hard disk, I would like to stream it toHDFS, Iknow thats possible using the "FileSystem" class and create method.Did a
small POC on that as well.
However I am not able to append to the created files.

It throws the exception:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Appendto hdfsnot supported. Please refer to dfs.support.append configurationparameter.
Am looking for any pointers/suggestions to resolve this?
Please let me know if you need any further information.

Thanks in advance.

Regards,
Chitta
MobileMe Reporting
Ext: 21294
Direct: 408-862-1294

Re: Appending to existing files in HDFS

Reply via email to