[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

marmbrus Mon, 16 Jun 2014 13:12:12 -0700

Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/999#discussion_r13827652
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile("people.parquet")
     # Parquet files can also be registered as tables and then used in SQL 
statements.
     parquetFile.registerAsTable("parquetFile");
     teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND 
age <= 19")
    -
    +teenNames = teenagers.map(lambda p: "Name: " + p.name)
    +for teenName in teenNames.collect():
    +  print teenName
     {% endhighlight %}
     
     </div>
     
     </div>
     
    -## Writing Language-Integrated Relational Queries
    +## JSON Datasets
    +<div class="codetabs">
    +
    +<div data-lang="scala"  markdown="1">
    +Spark SQL supports querying JSON datasets. To query a JSON dataset, a 
SchemaRDD needs to be created for this JSON dataset. There are two ways to 
create a SchemaRDD for a JSON dataset:
    --- End diff --
    
    How about:
    
    Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a SchemaRDD.  This conversion can be done using one of two methods:
     - `jsonFile` - loads data from a directory of JSON files where each line 
of the files is a JSON object.
     - `jsonRdd` - loads data from an existing RDD where each element of the 
RDD is a string containing a JSON object.
    
    We should probably also have a brief section on the rules of schema 
inference as they are non-obvious.
    
    A future TODO:  There is no reason that we need the json objects to be one 
per line right?  We can know when an object has not ended and could continue 
reading past a line break, right?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

Reply via email to