[
https://issues.apache.org/jira/browse/SPARK-13570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shubhanshu Mishra updated SPARK-13570:
--------------------------------------
Description:
Running the following code to store data from each year and pos in a seperate
folder for a very large dataframe is taking a huge amount of time. (>36 hours
for 60% of the work)
{code}
## IPYTHON was started using the following command:
# IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
conf = SparkConf()
conf.setMaster("local[30]")
conf.setAppName("analysis")
conf.set("spark.local.dir", "./tmp")
conf.set("spark.executor.memory", "50g")
conf.set("spark.driver.maxResultSize", "5g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("csv").options(header=False, inferschema=True,
delimiter="\t").load("out/new_features")
df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in
enumerate(columns)))
# year can take values from [1902,2015]
# pos takes integer values from [-1,0,1,2]
# df is a dataframe with 20 columns and 1 billion rows
# Running on Machine with 32 cores and 500 GB RAM
df.write.save("out/model_input_partitioned", format="csv", partitionBy=["year",
"pos"], delimiter="\t")
{code}
Curr
was:
Running the following code to store data from each year and pos in a seperate
folder for a very large dataframe is taking a huge amount of time. (>36 hours
for 60% of the work)
{code:python}
## IPYTHON was started using the following command:
# IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
conf = SparkConf()
conf.setMaster("local[30]")
conf.setAppName("analysis")
conf.set("spark.local.dir", "./tmp")
conf.set("spark.executor.memory", "50g")
conf.set("spark.driver.maxResultSize", "5g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.format("csv").options(header=False, inferschema=True,
delimiter="\t").load("out/new_features")
df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in
enumerate(columns)))
# year can take values from [1902,2015]
# pos takes integer values from [-1,0,1,2]
# df is a dataframe with 20 columns and 1 billion rows
# Running on Machine with 32 cores and 500 GB RAM
df.write.save("out/model_input_partitioned", format="csv", partitionBy=["year",
"pos"], delimiter="\t")
{code}
Curr
> pyspark save with partitionBy is very slow
> ------------------------------------------
>
> Key: SPARK-13570
> URL: https://issues.apache.org/jira/browse/SPARK-13570
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Reporter: Shubhanshu Mishra
> Labels: dataframe, partitioning, pyspark, save
>
> Running the following code to store data from each year and pos in a seperate
> folder for a very large dataframe is taking a huge amount of time. (>36 hours
> for 60% of the work)
> {code}
> ## IPYTHON was started using the following command:
> # IPYTHON=1 "$SPARK_HOME/bin/pyspark" --driver-memory 50g
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SQLContext, Row
> from pyspark.sql.types import *
> conf = SparkConf()
> conf.setMaster("local[30]")
> conf.setAppName("analysis")
> conf.set("spark.local.dir", "./tmp")
> conf.set("spark.executor.memory", "50g")
> conf.set("spark.driver.maxResultSize", "5g")
> sc = SparkContext(conf=conf)
> sqlContext = SQLContext(sc)
> df = sqlContext.read.format("csv").options(header=False, inferschema=True,
> delimiter="\t").load("out/new_features")
> df = df.selectExpr(*("%s as %s" % (df.columns[i], k) for i,k in
> enumerate(columns)))
> # year can take values from [1902,2015]
> # pos takes integer values from [-1,0,1,2]
> # df is a dataframe with 20 columns and 1 billion rows
> # Running on Machine with 32 cores and 500 GB RAM
> df.write.save("out/model_input_partitioned", format="csv",
> partitionBy=["year", "pos"], delimiter="\t")
> {code}
> Curr
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]