system. Thus we have to rotate logfiles at a greater
frequency
that we'd like to checkpoint the data into HDFS. The system
certainly
isn't perfect but bulk-loading the data into HDFS was proving
rather slow.
I'd be curious to hear actual performance numbers and methodologies
for bulk
loads
This request isn't so much about loading data into HDFS, but we really
need the ability to create a file that supports atomic appends for the
HBase redo log. Since HDFS files currently don't exist until they are
closed, the best we can do right now is close the current redo log and
open a new one
certainly
isn't perfect but bulk-loading the data into HDFS was proving
rather slow.
I'd be curious to hear actual performance numbers and methodologies
for bulk
loads. I'll try to dig some up myself on Monday.
On 8/2/07, Dennis Kubes [EMAIL PROTECTED] wrote:
You can copy data from any
Hadoop Aggregate package (o.a.h.mapred.lib.aggregate) is a good fit for your
aggregation problem.
Runping
-Original Message-
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 07, 2007 12:09 PM
To: hadoop-user@lucene.apache.org
Subject: Re: Loading data into HDFS
Am I missing something very fundamental ? Can someone comment on these
queries ?
Thanks,
Venkates P B
On 8/1/07, Venkates .P.B. [EMAIL PROTECTED] wrote:
Few queries regarding the way data is loaded into HDFS.
-Is it a common practice to load the data into HDFS only through the
master node
thanks,
DT
www.ejinz.com
Search News
- Original Message -
From: Venkates .P.B. [EMAIL PROTECTED]
To: hadoop-user@lucene.apache.org
Sent: Friday, August 03, 2007 1:41 AM
Subject: Re: Loading data into HDFS
Am I missing something very fundamental ? Can someone comment on these
queries
You can copy data from any node, so if you can do it from multiple nodes
your performance would be better (although be sure not to overlap
files). The master node is updated once a the block is copied it
replication number of times. So if default replication is 3 then the 3
replicates must
Few queries regarding the way data is loaded into HDFS.
-Is it a common practice to load the data into HDFS only through the master
node ? We are able to copy only around 35 logs (64K each) per minute in a 2
slave configuration.
-We are concerned about time it would take to update filenames and