[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

Luciano Resende (JIRA) Fri, 18 Nov 2016 03:37:37 -0800

    [ 
https://issues.apache.org/jira/browse/BAHIR-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15676530#comment-15676530
 ]


Luciano Resende commented on BAHIR-67:
--------------------------------------

Having to duplicate support for every file format in the WebHDFS was looking to 
me like a lot of work and signalizing tht there might be something wrong with 
some of the Spark design. After some investigation, it seems that Spark treats 
these file-based data sources as a special case where they all extend the 
TextBasedFileFormat which seems to handle the task of accessing and reading the 
files, independent if they are actually locally or in HDFS. 

What we are currently implementing here probably works and should give us a bit 
more flexibility, but in the end, might get us with a lot of duplicated code. 

I am just adding this comment to make sure we are all aware of this special 
case in Spark, and that we keep an open mind to what would be the best approach 
to bring the webHDFS support for file-based data sources.

> WebHDFS Data Source for Spark SQL
> ---------------------------------
>
>                 Key: BAHIR-67
>                 URL: https://issues.apache.org/jira/browse/BAHIR-67
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark SQL Data Sources
>            Reporter: Sourav Mazumder
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Ability to read/write data in Spark from/to HDFS of a remote Hadoop Cluster
> In today's world of Analytics many use cases need capability to access data 
> from multiple remote data sources in Spark. Though Spark has great 
> integration with local Hadoop cluster it lacks heavily on capability for 
> connecting to a remote Hadoop cluster. However, in reality not all data of 
> enterprises in Hadoop and running Spark Cluster locally with Hadoop Cluster 
> is not always a solution.
> In this improvement we propose to create a connector for accessing data (read 
> and write) from/to HDFS of a remote Hadoop cluster from Spark using webhdfs 
> api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BAHIR-67) WebHDFS Data Source for Spark SQL

Reply via email to