[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-30 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710156#comment-15710156
 ] 

Christian Kadner commented on BAHIR-67:
---

it looks like the only method we would need to override is 
{{WebHdfsFileSystem#toUrl}}:


{code:title=org.apache.hadoop.hdfs.web.WebHdfsFileSystem|borderStyle=solid}

  URL toUrl(final HttpOpParam.Op op, final Path fspath,
  final Param... parameters) throws IOException {
//initialize URI path and query
final String path = PATH_PREFIX// PATH_PREFIX = "/webhdfs/v1"
+ (fspath == null? "/": makeQualified(fspath).toUri().getRawPath());
final String query = op.toQueryString()
+ Param.toSortedString("&", getAuthParameters(op))
+ Param.toSortedString("&", parameters);
final URL url = getNamenodeURL(path, query);
LOG.trace("url={}", url);
return url;
  }

{code}

> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-30 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709401#comment-15709401
 ] 

Christian Kadner commented on BAHIR-67:
---

Hi Sourav,

as I understand your code, the problems in the Hadoop client code which you are 
trying to work around are the user authentication (properties) and making sure 
the Knox gateway path segment is included in the HTTP(S) URL(s) that get 
produced by the WebHDFS file system client code.

Most of the remaining code in your connector is a duplication or close 
adaptation of the Spark CSV code (parser, reader, writer, ...).

Does it make sense to instead only override the class 
org.apache.hadoop.hdfs.web.WebHdfsFileSystem and provide our own implementation 
of that via the property fs.webhdfs.impl. This custom "BahirWebHdfsFileSystem" 
implementation could take care of the authentication (properties) and the Knox 
gateway path segment being injected into the HTTP(S) URL(s) being sent to the 
remote Hadoop cluster. Ideally in a configurable way that could be applied to 
other types of secured Hadoop system besides Apache Knox.

> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-18 Thread Sourav Mazumder (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676999#comment-15676999
 ] 

Sourav Mazumder commented on BAHIR-67:
--

Hi Steve,

Few followup Qs to get more clarity on your comment -

1. Are you suggesting use of the hadoop-hdfs/hadoop-hdfs-client jar so that
we can use apis as "webhdfs://:/" instead of "http://
:/webhdfs/v1/?op=..." ? (I'm referring the section
FileSystem URIs vs HTTP URLs in
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#FileSystem_URIs_vs_HTTP_URLs)
?

2. Are you suggesting to use this in the main code or in integration test
code ?

Regards,
Sourav





On Fri, Nov 18, 2016 at 6:34 AM, Steve Loughran (JIRA) 



> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-18 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15676859#comment-15676859
 ] 

Steve Loughran commented on BAHIR-67:
-

This is what confuses me: all you should need to do for webhdfs access is use 
the right path, one that begins {{webhdfs://}}; the implementation is in the 
hadoop-hdfs/hadoop-hdfs-client JAR and it shoud just work, What would be useful 
is integration tests; a MiniHDFSCluster can be brought up with webhdfs enabled 
for that testing

> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: New Feature
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-11-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15631140#comment-15631140
 ] 

ASF GitHub Bot commented on BAHIR-67:
-

Github user ckadner commented on the issue:

https://github.com/apache/bahir/pull/25
  
@sourav-mazumder please add some description to this PR which could include 
outstanding issues and tag the PR title with the JIRA key and add the tag 
`[WIP]` while work on this PR is ongoing... i.e. the title could look like  
`"[BAHIR-67][WIP] Create WebHDFS data source for Spark"` -- Thanks


> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: Improvement
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

2016-10-17 Thread Sourav Mazumder (JIRA)

[ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582597#comment-15582597
 ] 

Sourav Mazumder commented on BAHIR-67:
--

[~ste...@apache.org] Thanks Steve for your comments.

This is much more than merely connecting webhdfs api for basic i/o. 

The specific features should go here are -

1. Addressing issues of getting data (and writing back) from/to a remote hdfs 
from a remote spark cluster in a performance efficient way so that large volume 
data can be pulled in using multiple connections in parallel with optimal data 
transfer across the clusters preserving the record boundary.

2. Handling SSL and authentication related issues transparently so that User 
does not need to get into coding for those hings.

3. Enabling this as a custom data source in Spark so that user can use it just 
like any other data source (e.g. jdbc).



> WebHDFS Data Source for Spark SQL
> -
>
> Key: BAHIR-67
> URL: https://issues.apache.org/jira/browse/BAHIR-67
> Project: Bahir
>  Issue Type: Improvement
>  Components: Spark SQL Data Sources
>Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)