[ https://issues.apache.org/jira/browse/PHOENIX-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389417#comment-14389417 ]
ASF GitHub Bot commented on PHOENIX-1071: ----------------------------------------- Github user apurtell commented on the pull request: https://github.com/apache/phoenix/pull/59#issuecomment-88248990 I'm not a Spark expert @JamesRTaylor . I skimmed the latest. Allowing builds with JDK 1.7 would have been the big change I'd have recommended, and it's already been done. I checked out this PR and ran a build, which completed. I was able to run the unit tests of the new module from the Maven command line, on Linux FWIW: $ mvn -DskipTests clean install $ mvn test -rf :phoenix-spark [...] - Can create valid SQL - Can convert Phoenix schema - Can create schema RDD and execute query - Can create schema RDD and execute query on case sensitive table (no config) - Can create schema RDD and execute constrained query - Using a predicate referring to a non-existent column should fail - Can create schema RDD with predicate that will never match - Can create schema RDD with complex predicate - Can query an array table - Can read a table as an RDD - Can save to phoenix table - Can save Java and Joda dates to Phoenix (no config) - Not specifying a zkUrl or a config quorum URL should fail Run completed in 1 minute, 12 seconds. Total number of tests run: 13 Suites: completed 2, aborted 0 Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0 All tests passed. With 7u75 I run out of PermGen running PhoenixRDDTest, but fixed that: diff --git a/phoenix-spark/pom.xml b/phoenix-spark/pom.xml index 5c0c754..21baa16 100644 --- a/phoenix-spark/pom.xml +++ b/phoenix-spark/pom.xml @@ -503,6 +503,7 @@ <configuration> <parallel>true</parallel> <tagsToExclude>Integration-Test</tagsToExclude> + <argLine>-Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=51 </configuration> </execution> <execution> The unit tests are not robust against parallel execution with other HBase or Phoenix test suite invocations on the same host, but this can be fixed with a followup issue with random ports and rebinding LGTM for a commit to trunk with some minor follow-ups. > Extend the org.apache.spark.sql.sources.RelationProvider and have PhoenixDatasource. Maybe we should split this work up. The integration as-is is directly useful on its own. The SparkSQL integration nice-to-have can be additional work on a new JIRA / PR? > Provide integration for exposing Phoenix tables as Spark RDDs > ------------------------------------------------------------- > > Key: PHOENIX-1071 > URL: https://issues.apache.org/jira/browse/PHOENIX-1071 > Project: Phoenix > Issue Type: New Feature > Reporter: Andrew Purtell > > A core concept of Apache Spark is the resilient distributed dataset (RDD), a > "fault-tolerant collection of elements that can be operated on in parallel". > One can create a RDDs referencing a dataset in any external storage system > offering a Hadoop InputFormat, like PhoenixInputFormat and > PhoenixOutputFormat. There could be opportunities for additional interesting > and deep integration. > Add the ability to save RDDs back to Phoenix with a {{saveAsPhoenixTable}} > action, implicitly creating necessary schema on demand. > Add support for {{filter}} transformations that push predicates to the server. > Add a new {{select}} transformation supporting a LINQ-like DSL, for example: > {code} > // Count the number of different coffee varieties offered by each > // supplier from Guatemala > phoenixTable("coffees") > .select(c => > where(c.origin == "GT")) > .countByKey() > .foreach(r => println(r._1 + "=" + r._2)) > {code} > Support conversions between Scala and Java types and Phoenix table data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)