subject:"\"Reading parquet files in parallel on the cluster\""

RE: Reading parquet files in parallel on the cluster

2021-05-30 Thread Boris Litvak

for and launch a job per directory. What am I missing? Boris From: Eric Beabes Sent: Wednesday, 26 May 2021 0:34 To: Sean Owen Cc: Silvio Fiorito ; spark-user Subject: Re: Reading parquet files in parallel on the cluster Right... but the problem is still the same, no? Those N Jobs (aka

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes

was, this will distribute the load amongst Spark Executors & >>>> will scale better. But this throws the NullPointerException shown in the >>>> original email. >>>> >>>> Is there a better way to do this? >>>> >>>> >>>> On Tue, Ma

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen

r. But this throws the NullPointerException shown in the >>> original email. >>> >>> Is there a better way to do this? >>> >>> >>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >>> silvio.fior...@granturing.com> wrote: >>> >&

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes

>> >> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >> silvio.fior...@granturing.com> wrote: >> >>> Why not just read from Spark as normal? Do these files have different or >>> incompatible schemas? >>> >>> >>> >&g

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen

ue, May 25, 2021 at 1:10 PM Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Why not just read from Spark as normal? Do these files have different or >> incompatible schemas? >> >> >> >> val df = spark.read.option(“mergeSchema”, “true”

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes

> > val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) > > > > *From: *Eric Beabes > *Date: *Tuesday, May 25, 2021 at 1:24 PM > *To: *spark-user > *Subject: *Reading parquet files in parallel on the cluster > > > > I've a use case in which

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Silvio Fiorito

Why not just read from Spark as normal? Do these files have different or incompatible schemas? val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user Subject: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen

Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote: > I've a use ca

Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes

I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when I try to

RE: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

Reading parquet files in parallel on the cluster

9 matches

Site Navigation

Mail list logo

Footer information