Hey Dilli,
Yes we stuggle with thinking about this all the time as well. Here is
our current thinking.
1. We aren't positioning the gateway as the import/export mechanism for
Haddop. The target consumer is really the data scientist who will
be submitting jobs and retrieving results. That being said we
haven't thought nearly enough about the relationship between the
gateway and thinks like sqoop.
2. This probably goes without saying but the gateway is a clustered
service and can be scaled (within reason) to accommodate the load.
This being said it doesn't make any sense if you end up needing as
many gateway instances as data nodes.
3. There is an common, unexpected (at least for me) usage pattern in
the field that results from a strong desire to have the Hadoop
cluster fire-walled away from the enterprise. This results in what
is typically called a "gateway machine" in the DMZ. Data workers
will typically scp their files to this machine and then ssh to this
machine. Then they will run Hadoop CLI command from that machine.
Note that this "gateway machine" has Hadoop installed and configured
including any Kerberos config if required. In some sense the
gateway is intended to allow the data worker to stay on their local
desktop and perform the same operations. So from my perspective the
bottleneck issue isn't really different and if we do are jobs well
it will be easier to solve with the gateway because it should be
easier to deploy a gateway instance than setup one of the "gateway
machines".
Note: Please subscribe to the Apache Knox mailing lists if you haven't
already and we should have these discussions there.
Kevin.
On 4/12/13 8:59 PM, Dilli Arumugam wrote:
First my understanding of the current behavior:
Gateway behavior:
When a client wants to upload a file using gateway all the file
content has to go through gateway.
If 1000 clients upload 1GB, file, 1TB has to go through gateway.
WebHDFS Behavior:
When a client wants to upload a file using WebHDFS, it gets a
redirected to datanode.
The file content goes to DataNode directly without passing through
NameNode.
If 1000 clients upload 1GB, file, 1TB could be sent to potentially to
1000 different datanode with each datanode getting 1GB.
If my understanding is right, gateway could become a bottleneck during
file uploads/reads.
Am I misunderstanding things?
Thanks
Dilli