Github user marmbrus commented on a diff in the pull request:
https://github.com/apache/spark/pull/999#discussion_r13827652
--- Diff: docs/sql-programming-guide.md ---
@@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile("people.parquet")
# Parquet files can also be registered as tables and then used in SQL
statements.
parquetFile.registerAsTable("parquetFile");
teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND
age <= 19")
-
+teenNames = teenagers.map(lambda p: "Name: " + p.name)
+for teenName in teenNames.collect():
+ print teenName
{% endhighlight %}
</div>
</div>
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+Spark SQL supports querying JSON datasets. To query a JSON dataset, a
SchemaRDD needs to be created for this JSON dataset. There are two ways to
create a SchemaRDD for a JSON dataset:
--- End diff --
How about:
Spark SQL can automatically infer the schema of a JSON dataset and load it
as a SchemaRDD. This conversion can be done using one of two methods:
- `jsonFile` - loads data from a directory of JSON files where each line
of the files is a JSON object.
- `jsonRdd` - loads data from an existing RDD where each element of the
RDD is a string containing a JSON object.
We should probably also have a brief section on the rules of schema
inference as they are non-obvious.
A future TODO: There is no reason that we need the json objects to be one
per line right? We can know when an object has not ended and could continue
reading past a line break, right?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---