Sourav Mazumder commented on BAHIR-67:

[~ste...@apache.org] Thanks Steve for your comments.

This is much more than merely connecting webhdfs api for basic i/o. 

The specific features should go here are -

1. Addressing issues of getting data (and writing back) from/to a remote hdfs 
from a remote spark cluster in a performance efficient way so that large volume 
data can be pulled in using multiple connections in parallel with optimal data 
transfer across the clusters preserving the record boundary.

2. Handling SSL and authentication related issues transparently so that User 
does not need to get into coding for those hings.

3. Enabling this as a custom data source in Spark so that user can use it just 
like any other data source (e.g. jdbc).

> WebHDFS Data Source for Spark SQL
> ---------------------------------
>                 Key: BAHIR-67
>                 URL: https://issues.apache.org/jira/browse/BAHIR-67
>             Project: Bahir
>          Issue Type: Improvement
>          Components: Spark SQL Data Sources
>            Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.

This message was sent by Atlassian JIRA

Reply via email to