Hi Kevin, I am already subscribed to [email protected] a few days back. I was not sure whether my question should be raised with Hortonworks list or in public. Hence, posted in internal list. If posting in public list adds value to Hortonworks, I could repost it in [email protected]. Shall I repost it there? Thanks Dilli
On Sat, Apr 13, 2013 at 8:13 AM, Kevin Minder <[email protected]>wrote: > Hey Dilli, > Yes we stuggle with thinking about this all the time as well. Here is our > current thinking. > > 1. We aren't positioning the gateway as the import/export mechanism > for Haddop. The target consumer is really the data scientist who will be > submitting jobs and retrieving results. That being said we haven't thought > nearly enough about the relationship between the gateway and thinks like > sqoop. > 2. This probably goes without saying but the gateway is a clustered > service and can be scaled (within reason) to accommodate the load. This > being said it doesn't make any sense if you end up needing as many gateway > instances as data nodes. > 3. There is an common, unexpected (at least for me) usage pattern in > the field that results from a strong desire to have the Hadoop cluster > fire-walled away from the enterprise. This results in what is typically > called a "gateway machine" in the DMZ. Data workers will typically scp > their files to this machine and then ssh to this machine. Then they will > run Hadoop CLI command from that machine. Note that this "gateway machine" > has Hadoop installed and configured including any Kerberos config if > required. In some sense the gateway is intended to allow the data worker > to stay on their local desktop and perform the same operations. So from my > perspective the bottleneck issue isn't really different and if we do are > jobs well it will be easier to solve with the gateway because it should be > easier to deploy a gateway instance than setup one of the "gateway > machines". > > Note: Please subscribe to the Apache Knox mailing lists if you haven't > already and we should have these discussions there. > Kevin. > > > On 4/12/13 8:59 PM, Dilli Arumugam wrote: > > First my understanding of the current behavior: > > Gateway behavior: > > When a client wants to upload a file using gateway all the file content > has to go through gateway. > If 1000 clients upload 1GB, file, 1TB has to go through gateway. > > WebHDFS Behavior: > > When a client wants to upload a file using WebHDFS, it gets a redirected > to datanode. > The file content goes to DataNode directly without passing through > NameNode. > If 1000 clients upload 1GB, file, 1TB could be sent to potentially to > 1000 different datanode with each datanode getting 1GB. > > If my understanding is right, gateway could become a bottleneck during > file uploads/reads. > Am I misunderstanding things? > > Thanks > Dilli > > > > >
