for and launch a job per directory.
What am I missing?
Boris
From: Eric Beabes
Sent: Wednesday, 26 May 2021 0:34
To: Sean Owen
Cc: Silvio Fiorito ; spark-user
Subject: Re: Reading parquet files in parallel on the cluster
Right... but the problem is still the same, no? Those N Jobs (aka
was, this will distribute the load amongst Spark Executors &
>>>> will scale better. But this throws the NullPointerException shown in the
>>>> original email.
>>>>
>>>> Is there a better way to do this?
>>>>
>>>>
>>>> On Tue, Ma
r. But this throws the NullPointerException shown in the
>>> original email.
>>>
>>> Is there a better way to do this?
>>>
>>>
>>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
>&
>>
>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>> Why not just read from Spark as normal? Do these files have different or
>>> incompatible schemas?
>>>
>>>
>>>
>&g
ue, May 25, 2021 at 1:10 PM Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Why not just read from Spark as normal? Do these files have different or
>> incompatible schemas?
>>
>>
>>
>> val df = spark.read.option(“mergeSchema”, “true”
>
> val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
>
>
>
> *From: *Eric Beabes
> *Date: *Tuesday, May 25, 2021 at 1:24 PM
> *To: *spark-user
> *Subject: *Reading parquet files in parallel on the cluster
>
>
>
> I've a use case in which
Why not just read from Spark as normal? Do these files have different or
incompatible schemas?
val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
From: Eric Beabes
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user
Subject: Reading parquet files in parallel on the cluster
Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.
On Tue, May 25, 2021 at 12:24 PM Eric Beabes
wrote:
> I've a use ca
I've a use case in which I need to read Parquet files in parallel from over
1000+ directories. I am doing something like this:
val df = list.toList.toDF()
df.foreach(c => {
val config = *getConfigs()*
doSomething(spark, config)
})
In the doSomething method, when I try to