Re: gateway design thougts on uploading files

Kevin Minder Sat, 13 Apr 2013 08:59:52 -0700

Hey Dilli,

Yes we stuggle with thinking about this all the time as well. Here isour current thinking.


1. We aren't positioning the gateway as the import/export mechanism for
   Haddop.  The target consumer is really the data scientist who will
   be submitting jobs and retrieving results. That being said we
   haven't thought nearly enough about the relationship between the
   gateway and thinks like sqoop.
2. This probably goes without saying but the gateway is a clustered

service and can be scaled (within reason) to accommodate the load.This being said it doesn't make any sense if you end up needing as

   many gateway instances as data nodes.
3. There is an common, unexpected (at least for me) usage pattern in
   the field that results from a strong desire to have the Hadoop
   cluster fire-walled away from the enterprise.  This results in what
   is typically called a "gateway machine" in the DMZ.  Data workers
   will typically scp their files to this machine and then ssh to this

machine. Then they will run Hadoop CLI command from that machine.Note that this "gateway machine" has Hadoop installed and configured

   including any Kerberos config if required.  In some sense the
   gateway is intended to allow the data worker to stay on their local
   desktop and perform the same operations.  So from my perspective the
   bottleneck issue isn't really different and if we do are jobs well
   it will be easier to solve with the gateway because it should be
   easier to deploy a gateway instance than setup one of the "gateway
   machines".

Note: Please subscribe to the Apache Knox mailing lists if you haven'talready and we should have these discussions there.

Kevin.

On 4/12/13 8:59 PM, Dilli Arumugam wrote:

First my understanding of the current behavior:

Gateway behavior:
When a client wants to upload a file using gateway all the filecontent has to go through gateway.
If 1000 clients upload 1GB, file, 1TB has to go through gateway.

WebHDFS Behavior:
When a client wants to upload a file using WebHDFS, it gets aredirected to datanode.The file content goes to DataNode directly without passing throughNameNode.If 1000 clients upload 1GB, file, 1TB could be sent to potentially to1000 different datanode with each datanode getting 1GB.
If my understanding is right, gateway could become a bottleneck duringfile uploads/reads.
Am I misunderstanding things?

Thanks
Dilli

Re: gateway design thougts on uploading files

Reply via email to