Hi

Thanks for your reply.

Well, Im am not sure about the speed of the connection to HDFS. The job that needs to unzip from a "normal" file to HDFS will be running on one of the machines participating in the HDFS, so I guess at least the access to the local part of the HDFS will be fast. But this of course will not help much because the data needs to be replicated to (2) other nodes. The connection among the HDFS-nodes I expect to be higher than 1 Gbps - 10 or 100. The zip file will actually live on a machine remote to all the HDFS-nodes. Those machines will have a mount to a machine i DMZ where the zip file will live, and access the zip file over that mount (probably a sshfs-mount). The connection between HDFS-nodes and the machine in DMZ I also expect to be higher than 1 Gbps - 10 or 100. But basically I really dont know yet about the speed of the different connections mentioned.

I seek the fastes way to do it. Of course I can use ZipFile etc. from the JDK to unzip and write the unzipped data to HDFS files, but if there are a more "direct"-I/O way I would prefer to do that. So basically this is a question about if a "smarter method" exist or not. Whether or not this "smarter method" will actually make the unzip-process faster or not of course will depend on whether or not the non-"direct"-I/O java-ZipFile-way will be a bigger bottleneck than the network-bandwith (among HDFS-nodes or between HDFS-nodes and DMZ).

Any addition comments are very welcome.

Stephan Gammeter skrev:
Hey Per,

Your performance will most likely be limited by your connection to HDFS and replication. If you are connected via 1Gbps lan and have 3-fold replication, then you can write at most 1 / 3 Gbps to HDFS. (Note: If you write many many small HDFS files then of course everything will be horribly slow anyways) I had to do something like this once (write files in a tar archive to a sequence file) and java was never the bottleneck. Or do you have massively higher connection to HDFS?

best,
Stephan

On 30.08.2011 10:19, Per Steffensen wrote:
Hi

I want to unzip a file that is living on an external (external from HDFS) filesystem to HDFS, so that the unzipped files end up in some folder on the HDFS. This needs to be as efficient as possible - so e.g. if it is done i java code it probably needs to involve java.nio.channels stuff or something that works directly with I/O resources. Can anyone point me to the best/easiest/most efficient way to do this? I would like to at least be able to invoke/initiate the unzip-process from java code, but I guess I can invoke anything from java, so that is not much of a requirement.

Regards, Per Steffensen




Reply via email to