ASF GitHub Bot commented on BAHIR-75:

Github user sourav-mazumder commented on the issue:

    @ckadner Here goes my response to your comments
    > Can you elaborate on differences/limitations/advantages over Hadoop 
default "webhdfs" scheme? i.e.
    the main problem you are working around it that the Hadoop 
WebHdfsFileSystem discards Knox gateway path when creating Http URL (principal 
motivation for this connector) which makes it impossible to use it with Knox
    > the Hadoop WebHdfsFileSystem implements additional interfaces like: 
    This is automatically taken care of by Apache Knox, in my understanding. 
That is one of the key goals of Apache Knox to relieve hadoop clients from 
nitigrity of internal security implementation of a hadoop Cluster. So we don't 
need to handle this at the code in client level if the webhdfs request is 
passing through Apache Knox.
    > performance differences between your approach vs Hadoop's RemoteFS and 
    Say a remote Spark cluster needs to read a file of size 2 GB and the Spark 
Cluster spawns 16 connections in parallel to do the same. So in turn 16 
separate webhdfs calls are made to remote hdfs. However, though each call tries 
to read the data from different starting point, for each of them the end byte 
is the end of file. So first connection creates input stream corresponding to 
0th byte till end of file, second from 128MB till end of file, the 3rd from 256 
MB till and of file and so on. As a result of that the amount of data prepared 
in the server side for sending as response, the data transferred over the wire, 
and the data being read by the client side can potentially be much more than 
the original file size (in this example of 2 GB worth of original file it can 
potentially be close to 17 GB). This number would increase further more with 
more number of connections. For larger file size the extent of increase would 
be further higher too.
    In the approach used in this PR, for the above example, the total volume of 
data read and transferred over the wire will be always limited to 2 GB and some 
extra KBs (for record boundary resolution). This number will increase to a very 
less extent (still in KBs range) for more number of connections. And this 
increment will not depend on file size. So if a big volume of file (in GBs) has 
to be read with high number of connections in parallel the amount of data being 
processed at server side, transferred over the wire, and read at client side 
would be always limited to original file size and some extra KBs (for record 
boundary resolution).
    > Configuration
    Some configuration parameters are specific to remote servers that should be 
specified by server not on connector level (some at server level may override 
connector level), i.e.
    Server level: 
    gateway path (assuming one Knox gateway per server)
    user name and password
    authentication method (think Kerberos etc)
    Connector level: 
    certificate validation options (maybe overridden by server level props)
    trustStore path
    webhdfs protocol version (maybe overridden by server level props)
    buffer sizes and file chunk sizes retry intervals etc
    You are right. However, I would put the 2 levels as Server Level and File 
Level. Some parameters won't change from file to file - they are specific to a 
remote hdfs server and therefore Server level parameters. Where as value of 
some parameters can be different from file to file. These are File level 
parameters. The Server Level parameters are - Gateway Path, User Name/Pasword, 
Webhdfs protocol version, Certificate Validation option (and other parameters 
associated with that). Where as File Level parameters are buffer sizes, file 
chunks sizes etc which can be different from File to File. 
    I don't see need for any property at connector level (the parameters which 
which would be same across different remote hdfs servers accessed by the 
connector). All properties here are related to either the nature of 
implementation of the remote HDFS server or the type of file being accessed. 
Let me know if I'm missing out any aspect here.
    > Usability
    Given that users need to know about the remote Hadoop server configuration 
(security, gateway path, etc) for WebHDFS access would it be nicer if ...
    users could separately configure server specific properties in a config 
file or registry object
    and then in Spark jobs only use :/ without having to provide additional 
    That's a good idea. We can have a set of default values for these 
parameters based on typical practice/convention. However, those default values 
can be overwritten if specified by user.
    > Security
    what authentication methods are supported besides basic auth (i.e. OAuth, 
Kerberos, ...)
    Right now this PR supports basic Auth at the Knox gateway level. Other 
authentication mechanisms supported by Apache Knox (SAML, OAuth, CAS, OpenId) 
are not supported yet.
    Apache Knox complements support for Kerberized hadoop cluster (for nodes to 
communicate and authenticate among themselves). That would be taken care of by 
Apache Knox transparently through relevant configuration.
    > should the connector manage auth tokens, token renewal, etc
    No. It is internally handled by Apache Knox. 
    > I don't think the connector should create a truststore, either skip 
certificate validation or take a user provided truststore path (btw, the 
current code fails to create a truststore on Mac OS X)
    On a second thought I'm with you
    > Debugging
    the code should have logging at INFO, DEGUG, ERROR levels using the Spark 
logging mechanisms (targeting the Spark log files)
    > Testing
    The outstanding unit tests should verify that the connector works with a ...
    standard Hadoop cluster (unsecured)
    This PR focuses only on secured Hadoop cluster. Unsecured hadoop cluster 
can be accessed using existing webhdfs client library available from hadoop. So 
we don't need this.
    > Hadoop clusters secured by Apache Knox
    > Hadoop clusters secured by other mechanisms like Kerberos
    We need not as it is more a feature of Apache Knox. 

> WebHDFS: Initial Code Delivery
> ------------------------------
>                 Key: BAHIR-75
>                 URL: https://issues.apache.org/jira/browse/BAHIR-75
>             Project: Bahir
>          Issue Type: Sub-task
>          Components: Spark SQL Data Sources
>            Reporter: Sourav Mazumder
>   Original Estimate: 504h
>  Remaining Estimate: 504h

This message was sent by Atlassian JIRA

Reply via email to