[jira] [Commented] (PHOENIX-1071) Provide integration for exposing Phoenix tables as Spark RDDs

ASF GitHub Bot (JIRA) Tue, 31 Mar 2015 10:21:09 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388868#comment-14388868
 ]


ASF GitHub Bot commented on PHOENIX-1071:
-----------------------------------------

Github user jmahonin commented on the pull request:

    https://github.com/apache/phoenix/pull/59#issuecomment-88176830
  
    Thanks for the feedback @mravi , point comments below:
    
    1: Right, I'll try get that sorted out. The original phoenix-spark library 
would not work with 1.7 for some reason, but that may no longer be the case.
    
    2: Good catch. I think IntelliJ did something a little funny here on me, 
that file was supposed to be in the main hierarchy.
    
    3 / 4: It's my first kick at extending Spark (and Phoenix for that matter), 
but the naming scheme and file separation was modelled off of DataStax' 
Spark-Cassandra connector, which I figured is as good a model as any:
    
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector
    
    In theory, doing it that way means a user can have just one import to get 
all the nice implicit definitions:
    `import org.apache.spark.phoenix._`
    
    5: I've never had much luck with getting the Scala integration working well 
on any IDE, I just run 'mvn test' from the CLI.
    
    Re: Good to haves
    1. I totally agree, but I don't think I can afford the cycles at the 
moment. My hope was that by modelling after the spark-cassandra-connector, it 
would be relatively painless to add for either a third party, or myself in the 
hopefully not-too-distant-future.
    2. Great idea, I hadn't actually seen that usage with Spark SQL yet. We're 
still using the RDD API internally. On a quick glance it looks fairly 
straight-forward to implement.


> Provide integration for exposing Phoenix tables as Spark RDDs
> -------------------------------------------------------------
>
>                 Key: PHOENIX-1071
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1071
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Andrew Purtell
>
> A core concept of Apache Spark is the resilient distributed dataset (RDD), a 
> "fault-tolerant collection of elements that can be operated on in parallel". 
> One can create a RDDs referencing a dataset in any external storage system 
> offering a Hadoop InputFormat, like PhoenixInputFormat and 
> PhoenixOutputFormat. There could be opportunities for additional interesting 
> and deep integration. 
> Add the ability to save RDDs back to Phoenix with a {{saveAsPhoenixTable}} 
> action, implicitly creating necessary schema on demand.
> Add support for {{filter}} transformations that push predicates to the server.
> Add a new {{select}} transformation supporting a LINQ-like DSL, for example:
> {code}
> // Count the number of different coffee varieties offered by each
> // supplier from Guatemala
> phoenixTable("coffees")
>     .select(c =>
>         where(c.origin == "GT"))
>     .countByKey()
>     .foreach(r => println(r._1 + "=" + r._2))
> {code} 
> Support conversions between Scala and Java types and Phoenix table data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1071) Provide integration for exposing Phoenix tables as Spark RDDs

Reply via email to