[GitHub] spark pull request: PySpark API for SparkSQL

marmbrus Wed, 09 Apr 2014 00:04:24 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/363#discussion_r11424723
  
    --- Diff: python/pyspark/context.py ---
    @@ -460,6 +463,189 @@ def sparkUser(self):
             """
             return self._jsc.sc().sparkUser()
     
    +class SQLContext:
    +    """
    +    Main entry point for SparkSQL functionality. A SQLContext can be used 
create L{SchemaRDD}s,
    +    register L{SchemaRDD}s as tables, execute sql over tables, cache 
tables, and read parquet files.
    +    """
    +
    +    def __init__(self, sparkContext):
    +        """
    +        Create a new SQLContext.
    +
    +        @param sparkContext: The SparkContext to wrap.
    +
    +        >>> from pyspark.context import SQLContext
    +        >>> sqlCtx = SQLContext(sc)
    +
    +        >>> rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"},
    +        ... {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": 
"row3"}])
    +
    +        >>> srdd = sqlCtx.applySchema(rdd)
    +        >>> sqlCtx.applySchema(srdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> bad_rdd = sc.parallelize([1,2,3])
    +        >>> sqlCtx.applySchema(bad_rdd) # doctest: +IGNORE_EXCEPTION_DETAIL
    +        Traceback (most recent call last):
    +            ...
    +        ValueError:...
    +
    +        >>> allTypes = sc.parallelize([{"int" : 1, "string" : "string", 
"double" : 1.0, "long": 1L,
    +        ... "boolean" : True}])
    +        >>> srdd = sqlCtx.applySchema(allTypes).map(lambda x: (x.int, 
x.string, x.double, x.long,
    +        ... x.boolean))
    +        >>> srdd.collect()[0]
    +        (1, u'string', 1.0, 1, True)
    +        """
    +        self._sc = sparkContext
    +        self._jsc = self._sc._jsc
    +        self._jvm = self._sc._jvm
    +
    +    @property
    +    def _ssql_ctx(self):
    +        """
    +        Accessor for the JVM SparkSQL context.  Subclasses can overrite 
this property to provide
    +        their own JVM Contexts.
    +        """
    +        if not hasattr(self, '_scala_SQLContext'):
    +            self._scala_SQLContext = self._jvm.SQLContext(self._jsc.sc())
    +        return self._scala_SQLContext
    +
    +    def applySchema(self, rdd):
    --- End diff --
    
    How about `inferSchema`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: PySpark API for SparkSQL

Reply via email to