allisonwang-db commented on code in PR #43897: URL: https://github.com/apache/spark/pull/43897#discussion_r1409849131
########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation Review Comment: ```suggestion Creating DataFrames in PySpark ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples Review Comment: Do we need to show this example? Which one is better? From lists or tuples? We should provide opinionated ways to create DataFrames. ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. Review Comment: Personally, I think we don't need this section. We can directly dive into different ways to create data frames and add some explanations there. ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + Review Comment: ```suggestion PySpark allows you to create DataFrames in several ways. Let's explore these methods with simple examples. ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists Review Comment: ```suggestion Creating a :class:`DataFrame` from Lists ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) Review Comment: ```suggestion schema = StructType([ StructField("name", StringType(), True), StructField("age", IntegerType(), True) ]) ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified Review Comment: Let's combine this with the previous section ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Review Comment: ```suggestion Define a schema and use it to create a DataFrame. A schema describes the column names and types. ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified Review Comment: ```suggestion Creating a :class:`DataFrame` with a Specified Schema ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| Review Comment: Maybe we should highlight that when the schema is not provided, the resulting data frame has `_1` and `_2` as the schema (this differs from pandas for example) ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Review Comment: ```suggestion Use the Row type to define rows of a DataFrame. ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` Review Comment: ```suggestion Creating a :class:`DataFrame` from :class:`Row`s ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql import Row + >>> Person = Row('name', 'age') + >>> df = spark.createDataFrame([Person("Alice", 1), Person("Bob", 5)]) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`pandas.DataFrame` Review Comment: ```suggestion Creating a :class:`DataFrame` from a :class:`pandas.DataFrame` or a :class:`numpy.ndarray` ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql import Row + >>> Person = Row('name', 'age') + >>> df = spark.createDataFrame([Person("Alice", 1), Person("Bob", 5)]) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`pandas.DataFrame` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> import pandas as pd + >>> df = spark.createDataFrame(pd.DataFrame([[1, 2]])) + >>> df.show() + +---+---+ + | 0| 1| + +---+---+ + | 1| 2| + +---+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`numpy.ndarray` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> import numpy as np + >>> import pandas as pd + >>> df = spark.createDataFrame(pd.DataFrame(data=np.array([[1, 2], [3, 4]]), + ... columns=['a', 'b'])) + >>> df.show() + +---+---+ + | a| b| + +---+---+ + | 1| 2| + | 3| 4| + +---+---+ + + +Creating through `read.format(...).load(...)` +--------------------------------------------- + +Creating a PySpark :class:`DataFrame` by reading existing **json** format file data Review Comment: Here we can combine all sections to show examples: ``` - Example with JSON <code block> - Example with CSV <code block> - Example with Parquet <code block> - Example with JDBC <code block> ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql import Row + >>> Person = Row('name', 'age') + >>> df = spark.createDataFrame([Person("Alice", 1), Person("Bob", 5)]) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`pandas.DataFrame` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> import pandas as pd + >>> df = spark.createDataFrame(pd.DataFrame([[1, 2]])) + >>> df.show() + +---+---+ + | 0| 1| + +---+---+ + | 1| 2| + +---+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`numpy.ndarray` Review Comment: We can combine this with the previous section. ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([{'name': 'Alice', 'age': 1}]) + >>> df.show() + +---+-----+ + |age| name| + +---+-----+ + | 1|Alice| + +---+-----+ + + +Creating a PySpark :class:`DataFrame` from a list of :class:`Row` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql import Row + >>> Person = Row('name', 'age') + >>> df = spark.createDataFrame([Person("Alice", 1), Person("Bob", 5)]) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`pandas.DataFrame` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> import pandas as pd + >>> df = spark.createDataFrame(pd.DataFrame([[1, 2]])) + >>> df.show() + +---+---+ + | 0| 1| + +---+---+ + | 1| 2| + +---+---+ + + +Creating a PySpark :class:`DataFrame` from a :class:`numpy.ndarray` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> import numpy as np + >>> import pandas as pd + >>> df = spark.createDataFrame(pd.DataFrame(data=np.array([[1, 2], [3, 4]]), + ... columns=['a', 'b'])) + >>> df.show() + +---+---+ + | a| b| + +---+---+ + | 1| 2| + | 3| 4| + +---+---+ + + +Creating through `read.format(...).load(...)` Review Comment: Reading Data from Files ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Review Comment: ```suggestion Dictionaries with keys as column names can also be used. ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * Review Comment: Let's not use `import *` ``` from pyspark.sql.types import StructType, StructField, StringType, IntegerType ``` ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") Review Comment: "name string, age int" Just curious, do we have any documentation on this DDL string format? How to translate a pyspark type into this DDL string format? ########## python/docs/source/user_guide/sql/dataframe_creation.rst: ########## @@ -0,0 +1,239 @@ +.. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +================== +DataFrame Creation +================== + +.. currentmodule:: pyspark.sql + +Creating through `createDataFrame` +---------------------------------- + +A PySpark :class:`DataFrame` can be created via :meth:`SparkSession.createDataFrame` typically by passing +a list of lists, tuples, dictionaries and :class:`Row`, a pandas :class:`pandas.DataFrame`, +a NumPy :class:`numpy.ndarray` and an :class:`pyspark.RDD`. +:meth:`SparkSession.createDataFrame` takes the `schema` argument to specify the schema of the :class:`DataFrame`. +When it is omitted, PySpark infers the corresponding schema by taking a sample from the data. + +Creating a PySpark :class:`DataFrame` from a list of lists +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([['Alice', 1], ['Bob', 5]]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of tuples +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)]) + >>> df.show() + +-----+---+ + | _1| _2| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> from pyspark.sql.types import * + >>> schema = StructType([StructField("name", StringType(), True), + ... StructField("age", IntegerType(), True)]) + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema) + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` with the explicit DDL-formatted string schema specified +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + >>> df = spark.createDataFrame([('Alice', 1), ('Bob', 5)], schema = "name string, age int") + >>> df.show() + +-----+---+ + | name|age| + +-----+---+ + |Alice| 1| + | Bob| 5| + +-----+---+ + + +Creating a PySpark :class:`DataFrame` from a list of dictionaries Review Comment: ```suggestion Creating a :class:`DataFrame` from Dictionaries ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
