Hi,
I have a pseudo-distributed Hadoop cluster setup, and I'm currently
hoping to put about 100 gigs of files on it to play around with. I got a
unix box at work no one else is using for this, and running a df -h, I get:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 7.9G 2.4G 5.2G 31% /
none 3.8G 0 3.8G 0% /dev/shm
/dev/sdb 414G 210M 393G 1% /mnt
Alright, so /mnt looks quite big and seems like a good place to store my
hdfs files. I go ahead and create a folder named hadoop-data there and
set the following in hdfs-site.xml:
<property>
<!-- where hadoop stores its files (datanodes only) -->
<name>dfs.name.dir</name>
<value>/mnt/hadoop-data</value>
</property>
After a bit of troubleshooting, I restart the cluster and try to put a
couple of test files onto HDFS. Doing an ls of hadoop-data, I see:
$ ls
current image in_use.lock previous.checkpoint
OK, things look good. Time to try uploading some real data. Now, here's
where the problem arises. If I add a 10mb dummy file to hadoop-data
through regular unix and run df -h, I see that the used space of /mnt
goes up exactly 10mb. But, when I start running a big dump of data through:
hadoop fs -put ~/hadoop_playground/data2/data2/ /data/
I notice that running df -h seems to put the data in completely the
wrong location! Note that below, only the usage of /dev/sda1 has
increased. /mnt has not moved.
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 7.9G 3.4G 4.2G 45% /
none 3.8G 0 3.8G 0% /dev/shm
/dev/sdb 414G 210M 393G 1% /mnt
So, what gives? Anyone have any clue how my files are seemingly both put
in the hadoop-data folder, but take up space elsewhere? I could see this
likely being a Unix issue, but I figured I'd ask here just in case it's
not, since I'm pretty stumped.
Cheers,
Eli