unsubscribe
2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성:
> I keep getting the following exception when I am trying to read a Parquet
> file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
>
> java.lang.NullPointerException
> at
> org.apache.spark.sql.SparkSession.sessi
Thanks for your time & advice. We will experiment & see which works best
for us EMR or ECS.
On Tue, May 25, 2021 at 2:39 PM Sean Owen wrote:
> No, the work is happening on the cluster; you just have (say) 100 parallel
> jobs running at the same time. You apply spark.read.parquet to each dir
No, the work is happening on the cluster; you just have (say) 100 parallel
jobs running at the same time. You apply spark.read.parquet to each dir --
from the driver yes, but spark.read is distributed. At extremes, yes that
would challenge the driver, to manage 1000s of jobs concurrently. You may
a
Right... but the problem is still the same, no? Those N Jobs (aka Futures
or Threads) will all be running on the Driver. Each with its own
SparkSession. Isn't that going to put a lot of burden on one Machine? Is
that really distributing the load across the cluster? Am I missing
something?
Would it
What you could do is launch N Spark jobs in parallel from the driver. Each
one would process a directory you supply with spark.read.parquet, for
example. You would just have 10s or 100s of those jobs running at the same
time. You have to write a bit of async code to do it, but it's pretty easy
wit
Here's the use case:
We've a bunch of directories (over 1000) which contain tons of small files
in each. Each directory is for a different customer so they are independent
in that respect. We need to merge all the small files in each directory
into one (or a few) compacted file(s) by using a 'coal
Why not just read from Spark as normal? Do these files have different or
incompatible schemas?
val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
From: Eric Beabes
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user
Subject: Reading parquet files in parallel on the cluster
Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.
On Tue, May 25, 2021 at 12:24 PM Eric Beabes
wrote:
> I've a use ca
I've a use case in which I need to read Parquet files in parallel from over
1000+ directories. I am doing something like this:
val df = list.toList.toDF()
df.foreach(c => {
val config = *getConfigs()*
doSomething(spark, config)
})
In the doSomething method, when I try to
I keep getting the following exception when I am trying to read a Parquet
file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
java.lang.NullPointerException
at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at
org.apache.spark
10 matches
Mail list logo