ASF GitHub Bot commented on BAHIR-75:
Github user sourav-mazumder commented on the issue:
@ckadner Here goes my response to your comments
> Can you elaborate on differences/limitations/advantages over Hadoop
default "webhdfs" scheme? i.e.
the main problem you are working around it that the Hadoop
WebHdfsFileSystem discards Knox gateway path when creating Http URL (principal
motivation for this connector) which makes it impossible to use it with Knox
> the Hadoop WebHdfsFileSystem implements additional interfaces like:
This is automatically taken care of by Apache Knox, in my understanding.
That is one of the key goals of Apache Knox to relieve hadoop clients from
nitigrity of internal security implementation of a hadoop Cluster. So we don't
need to handle this at the code in client level if the webhdfs request is
passing through Apache Knox.
> performance differences between your approach vs Hadoop's RemoteFS and
Say a remote Spark cluster needs to read a file of size 2 GB and the Spark
Cluster spawns 16 connections in parallel to do the same. So in turn 16
separate webhdfs calls are made to remote hdfs. However, though each call tries
to read the data from different starting point, for each of them the end byte
is the end of file. So first connection creates input stream corresponding to
0th byte till end of file, second from 128MB till end of file, the 3rd from 256
MB till and of file and so on. As a result of that the amount of data prepared
in the server side for sending as response, the data transferred over the wire,
and the data being read by the client side can potentially be much more than
the original file size (in this example of 2 GB worth of original file it can
potentially be close to 17 GB). This number would increase further more with
more number of connections. For larger file size the extent of increase would
be further higher too.
In the approach used in this PR, for the above example, the total volume of
data read and transferred over the wire will be always limited to 2 GB and some
extra KBs (for record boundary resolution). This number will increase to a very
less extent (still in KBs range) for more number of connections. And this
increment will not depend on file size. So if a big volume of file (in GBs) has
to be read with high number of connections in parallel the amount of data being
processed at server side, transferred over the wire, and read at client side
would be always limited to original file size and some extra KBs (for record
Some configuration parameters are specific to remote servers that should be
specified by server not on connector level (some at server level may override
connector level), i.e.
gateway path (assuming one Knox gateway per server)
user name and password
authentication method (think Kerberos etc)
certificate validation options (maybe overridden by server level props)
webhdfs protocol version (maybe overridden by server level props)
buffer sizes and file chunk sizes retry intervals etc
You are right. However, I would put the 2 levels as Server Level and File
Level. Some parameters won't change from file to file - they are specific to a
remote hdfs server and therefore Server level parameters. Where as value of
some parameters can be different from file to file. These are File level
parameters. The Server Level parameters are - Gateway Path, User Name/Pasword,
Webhdfs protocol version, Certificate Validation option (and other parameters
associated with that). Where as File Level parameters are buffer sizes, file
chunks sizes etc which can be different from File to File.
I don't see need for any property at connector level (the parameters which
which would be same across different remote hdfs servers accessed by the
connector). All properties here are related to either the nature of
implementation of the remote HDFS server or the type of file being accessed.
Let me know if I'm missing out any aspect here.
Given that users need to know about the remote Hadoop server configuration
(security, gateway path, etc) for WebHDFS access would it be nicer if ...
users could separately configure server specific properties in a config
file or registry object
and then in Spark jobs only use :/ without having to provide additional
That's a good idea. We can have a set of default values for these
parameters based on typical practice/convention. However, those default values
can be overwritten if specified by user.
what authentication methods are supported besides basic auth (i.e. OAuth,
Right now this PR supports basic Auth at the Knox gateway level. Other
authentication mechanisms supported by Apache Knox (SAML, OAuth, CAS, OpenId)
are not supported yet.
Apache Knox complements support for Kerberized hadoop cluster (for nodes to
communicate and authenticate among themselves). That would be taken care of by
Apache Knox transparently through relevant configuration.
> should the connector manage auth tokens, token renewal, etc
No. It is internally handled by Apache Knox.
> I don't think the connector should create a truststore, either skip
certificate validation or take a user provided truststore path (btw, the
current code fails to create a truststore on Mac OS X)
On a second thought I'm with you
the code should have logging at INFO, DEGUG, ERROR levels using the Spark
logging mechanisms (targeting the Spark log files)
The outstanding unit tests should verify that the connector works with a ...
standard Hadoop cluster (unsecured)
This PR focuses only on secured Hadoop cluster. Unsecured hadoop cluster
can be accessed using existing webhdfs client library available from hadoop. So
we don't need this.
> Hadoop clusters secured by Apache Knox
> Hadoop clusters secured by other mechanisms like Kerberos
We need not as it is more a feature of Apache Knox.
> WebHDFS: Initial Code Delivery
> Key: BAHIR-75
> URL: https://issues.apache.org/jira/browse/BAHIR-75
> Project: Bahir
> Issue Type: Sub-task
> Components: Spark SQL Data Sources
> Reporter: Sourav Mazumder
> Original Estimate: 504h
> Remaining Estimate: 504h
This message was sent by Atlassian JIRA