Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Riccardo Ferrari
You need to provide your python dependencies as well. See http://spark.apache.org/docs/latest/submitting-applications.html, look for --py-files HTH On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh wrote: > Hi, > > I have a module in Pycharm which reads data stored in a Bigquery table and > does p

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Riccardo Ferrari
I think spark checks the python path env variable. Need to provide that. Of course that works in local mode only On Fri, Jan 8, 2021, 5:28 PM Sean Owen wrote: > I don't see anywhere that you provide 'sparkstuff'? how would the Spark > app have this code otherwise? > > On Fri, Jan 8, 2021 at 10:2

Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
Hi list, I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy. This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline. The ultimate goal is to fit multiple models, one

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Riccardo Ferrari
have to have each read the subset of data > that it's meant to model separately. > > A pandas UDF is a fine solution here, because I assume that implies your > groups aren't that big, so, maybe no need for a Spark pipeline. > > > On Thu, Jan 21, 2021 at 9:20 A

Re: Maximum executors in EC2 Machine

2023-10-24 Thread Riccardo Ferrari
Hi, I would refer to their documentation to better understand the concepts behind cluster overview and submitting applications: - https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types - https://spark.apache.org/docs/latest/submitting-applications.html When us

Re: Read Data From NFS

2017-06-13 Thread Riccardo Ferrari
Hi Ayan, You might be interested in the official Spark docs: https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism and its spark.default.parallelism setting Best, On Mon, Jun 12, 2017 at 6:18 AM, ayan guha wrote: > I understand how it works with hdfs. My question is when hdfs is

Re: UDF percentile_approx

2017-06-14 Thread Riccardo Ferrari
Hi Andres, I can't find the refrence, last time I searched for that I found that 'percentile_approx' is only available via hive context. You should register a temp table and use it from there. Best, On Tue, Jun 13, 2017 at 8:52 PM, Andrés Ivaldi wrote: > Hello, I`m trying to user percentile_ap

Re: Create dataset from dataframe with missing columns

2017-06-15 Thread Riccardo Ferrari
Hi Jason, Is there a reason why you are not adding the desired column before mapping it to a Dataset[CC]? You could just do something like: df = df.withColumn('f2', ) then do the: df.as(CC) Of course your default value can be null: lit(None).cast(to-some-type) best, On Thu, Jun 15, 2017 at 1:26

Re: Cassandra querying time stamps

2017-06-20 Thread Riccardo Ferrari
Hi, Personally I would inspect how dates are managed. How does your spark code looks like? What does the explain say. Does TimeStamp gets parsed the same way? Best, On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog wrote: > Hello, > > I have a table as below > > CREATE TABLE analytics_db.ml_forecas

Re: Problem in avg function Spark 1.6.3 using spark-shell

2017-06-25 Thread Riccardo Ferrari
Hi, Looks like you performed an aggregation on the ImageWidth column already. The error itself is quite self-explanatory: Cannot resolve column name "ImageWidth" among (MainDomainCode, *avg(length(ImageWidth))*) The column available in that DF are MainDomainCode and avg(length(ImageWidth)) so yo

PySpark saving custom pipelines

2017-07-09 Thread Riccardo Ferrari
Hi list, I have developed some custom Transformer/Estimators in python around some libraries (scipy), now I would love to persist them for reuse in a streaming app. I am currently aware of this: https://issues.apache.org/jira/browse/SPARK-17025 However I would like to hear from experienced users

Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Hi, Are you sure you're feeding the correct data format? I found this conversation that might be useful: http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html Best, On Tue, Jul 11, 2017 at 1:42 PM, mckunkel wrote: > Greetings, > > Foll

Re: Testing another Dataset after ML training

2017-07-11 Thread Riccardo Ferrari
Mh, to me feels like there some data mismatch. Are you sure you're using the expected Vector (ml vs mllib). I am not sure you attached the whole Exception but you might find some more useful details there. Best, On Tue, Jul 11, 2017 at 3:07 PM, mckunkel wrote: > Im not sure why I cannot subscri

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
; Michael C. Kunkel, USMC, PhD > Forschungszentrum Jülich > Nuclear Physics Institute and Juelich Center for Hadron Physics > Experimental Hadron Structure (IKP-1)www.fz-juelich.de/ikp > > On 11/07/2017 17:21, Riccardo Ferrari wrote: > > Mh, to me feels like there some data mismat

Re: Testing another Dataset after ML training

2017-07-12 Thread Riccardo Ferrari
> Experimental Hadron Structure (IKP-1) > www.fz-juelich.de/ikp > > On Jul 12, 2017, at 09:56, Riccardo Ferrari wrote: > > Hi Michael, > > I don't see any attachment, not sure you can attach files though > > On Tue, Jul 11, 2017 at 10:44 PM, Michael C. Kunkel &l

Re: Spark UI crashes on Large Workloads

2017-07-17 Thread Riccardo Ferrari
Hi, can you share more details. do you have any exceptions from the driver? or executors? best, On Jul 18, 2017 02:49, "saatvikshah1994" wrote: > Hi, > > I have a pyspark App which when provided a huge amount of data as input > throws the error explained here sometimes: > https://stackoverflow

Re: Spark UI crashes on Large Workloads

2017-07-18 Thread Riccardo Ferrari
ervice(HttpServlet.java:820) > > > > I noticed this issue talks about something similar and I guess is related: > https://issues.apache.org/jira/browse/SPARK-18838. > > On Tue, Jul 18, 2017 at 2:49 AM, Riccardo Ferrari > wrote: > >> Hi, >> can you share more

Re: Flatten JSON to multiple columns in Spark

2017-07-18 Thread Riccardo Ferrari
What's against: df.rdd.map(...) or dataset.foreach() https://spark.apache.org/docs/2.0.1/api/scala/index.html#org.apache.spark.sql.Dataset@foreach(f:T= >Unit):Unit Best, On Tue, Jul 18, 2017 at 6:46 PM, lucas.g...@gmail.com wrote: > I've been wondering about this for awhile. > > We wanted t

Re: Logging in RDD mapToPair of Java Spark application

2017-07-29 Thread Riccardo Ferrari
Hi John, The reason you don't see the second sysout line is because is executed on a different JVM (ie. Driver vs Executor). the second sysout line should be available through the executor logs. Check the Executors tab. There are alternative approaches to manage log centralization however it real

Re: SPARK Issue in Standalone cluster

2017-07-31 Thread Riccardo Ferrari
Hi Gourav, The issue here is the location where you're trying to write/read from : /Users/gouravsengupta/Development/spark/sparkdata/test1/p... When dealing with clusters all the paths and resources should be available to all executors (and driver), and that is reason why you generally use HDFS, S

PySpark Streaming S3 checkpointing

2017-08-02 Thread Riccardo Ferrari
Hi list! I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this: class StreamingJob(): def __init__(..): ... sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',) sparkContext._jsc.hadoopConfigur

Re: PySpark Streaming S3 checkpointing

2017-08-03 Thread Riccardo Ferrari
context is ignored (or at least this happens in python). What do you (or any in the list) think? Thanks, On Wed, Aug 2, 2017 at 6:04 PM, Steve Loughran wrote: > > On 2 Aug 2017, at 10:34, Riccardo Ferrari wrote: > > Hi list! > > I am working on a pyspark streaming

PySpark Streaming keeps dying

2017-08-05 Thread Riccardo Ferrari
Hi list, I have Sark 2.2.0 in standalone mode and python 3.6. It is a very small testing cluster with two nodes. I am running (trying) a streaming job that simple read from kafka, apply an ML model and store it back into kafka. The job is run with following parameters: "--conf spark.cores.max=2 --

Re: Spark Streaming job statistics

2017-08-08 Thread Riccardo Ferrari
Hi, Have you tried to check the "Streaming" tab menu? Best, On Tue, Aug 8, 2017 at 4:15 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am running spark streaming job which receives data from azure iot hub. I > am not sure if the connection was successful and receving any

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Riccardo Ferrari
Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false) +--+-

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Riccardo Ferrari
Hi Brian, Very nice work you have done! WRT you issue: Can you clarify how are you adding the kafka dependency when using Jupyter? The ClassNotFoundException really tells you about the missing dependency. A bit different is the IllegalArgumentException error, that is simply because you are not t

Re: Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Riccardo Ferrari
Hi Aakash, What I see in the picture seems correct. Spark (pyspark) is reading your F2 cell as a multi-line text. Where are the nulls you're referring to? You might find the pyspark.sql.functions.regexp_replace

Re: How to convert Row to JSON in Java?

2017-09-10 Thread Riccardo Ferrari
Hi Kant, You can check the getValuesMap . I found this post useful, it is in Scala but should be a good starting point. An alternative appro

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
1.take(10) > ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}', '{"id": 5}'] > > > > > On Mon, Sep 11, 2017 at 4:22 AM, Riccardo Ferrari > wrote: > >&g

Re: How to convert Row to JSON in Java?

2017-09-11 Thread Riccardo Ferrari
> JavaConverters.asScalaIteratorConverter(Arrays.asList(fieldNames).iterator()).asScala().toSeq(); > String json = row.getValuesMap(fieldNamesSeq).toString(); > > > On Mon, Sep 11, 2017 at 12:39 AM, Riccardo Ferrari > wrote: > >> Hi Ayan, yup that works very well however I

Re: [Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-18 Thread Riccardo Ferrari
Hi Jean, What does the master UI say? http://10.0.100.81:8080 Do you have enough resources availalbe, or is there any running context that is depleting all your resources ? Are your workers registered and alive ? How much memory each? How many cores each ? Best On Mon, Sep 18, 2017 at 11:24 PM,

Re: How to pass sparkSession from driver to executor

2017-09-21 Thread Riccardo Ferrari
Depends on your use-case however broadcasting could be a better option. On Thu, Sep 21, 2017 at 2:03 PM, Chackravarthy Esakkimuthu < chaku.mi...@gmail.com> wrote: > Hi, > > I want to know how to pass sparkSession

Re: jar file problem

2017-10-19 Thread Riccardo Ferrari
This is a good place to start from: https://spark.apache.org/docs/latest/submitting-applications.html Best, On Thu, Oct 19, 2017 at 5:24 PM, Uğur Sopaoğlu wrote: > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is there any way to do s

Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris, I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames

Re: Connection issue with AWS S3 from PySpark 2.3.1

2018-12-21 Thread Riccardo Ferrari
Hi Aakash, Can you share how are you adding those jars? Are you using the package method ? I assume you're running in a cluster, and those dependencies might have not properly distributed. How are you submitting your app? What kind of resource manager are you using standalone, yarn, ... Best, O

I have trained a ML model, now what?

2019-01-22 Thread Riccardo Ferrari
Hi list! I am writing here to here about your experience on putting Spark ML models into production at scale. I know it is a very broad topic with many different faces depending on the use-case, requirements, user base and whatever is involved in the task. Still I'd like to open a thread about th

Re: How to query on Cassandra and load results in Spark dataframe

2019-01-23 Thread Riccardo Ferrari
Hi Soheil, You should able to apply some filter transformation. Spark is lazy evaluated and the actual loading from Cassandra happens only when an action triggers it. Find more here: https://spark.apache.org/docs/2.3.2/rdd-programming-guide.html#rdd-operations The Spark Cassandra supports filters

Re: I have trained a ML model, now what?

2019-01-23 Thread Riccardo Ferrari
ayers like Uber trying to make Open Source better, thanks! On Tue, Jan 22, 2019 at 5:24 PM Felix Cheung wrote: > About deployment/serving > > SPIP > https://issues.apache.org/jira/browse/SPARK-26247 > > > ------ > *From:* Riccardo Ferrari > *Se

PySpark OOM when running PCA

2019-02-07 Thread Riccardo Ferrari
Hi list, I am having troubles running a PCA with pyspark. I am trying to reduce a matrix size since my features after OHE gets 40k wide. Spark 2.2.0 Stand-alone (Oracle JVM) pyspark 2.2.0 from a docker (OpenJDK) I'm starting the spark session from the notebook however I make sure to: - PYSPA

Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Hi list, I am trying to undestand if ti make sense to leverage on Spark as enabling platform for Deep Learning. My open question to you are: - Do you use Apache Spark in you DL pipelines? - How do you use Spark for DL? Is it just a stand-alone stage in the workflow (ie data preparation

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Sengupta > > Date: May 4, 2019 at 10:24:29 AM > To: Riccardo Ferrari > Cc: User > Subject: Re: Deep Learning with Spark, what is your experience? > > Try using MxNet and Horovod directly as well (I think that MXNet is worth > a try as well): > 1. > https://medi

Re: Deep Learning with Spark, what is your experience?

2019-05-05 Thread Riccardo Ferrari
g part of the pipeline (afaik) so it is >>>> limited to data ingestion and transforms (ETL). It therefore is optional >>>> and other ETL options might be better for you. >>>> >>>> Most of the technologies @Gourav mentions have their own scaling based >

Spark standalone and Pandas UDF from custom archive

2019-05-24 Thread Riccardo Ferrari
Hi list, I am having an issue ditributing a pandas_udf to my workers. I'm using Spark 2.4.1 in standalone mode. *Submit*: - via SparkLauncher as separate process. I do add the py-files with the self-executable zip (with .py extension) before launching the application. - The whole applica

Re: Spark standalone and Pandas UDF from custom archive

2019-05-25 Thread Riccardo Ferrari
thout requiring any external tool (jobserver/livy/...) Best, On Fri, May 24, 2019 at 10:24 PM Riccardo Ferrari wrote: > Hi list, > > I am having an issue ditributing a pandas_udf to my workers. > I'm using Spark 2.4.1 in standalone mode. > *Submit*: > >- via SparkLa

Re: Spark standalone and Pandas UDF from custom archive

2019-05-25 Thread Riccardo Ferrari
our job modules and associated library code into a > single zip file (jobs.py). You can pass this zip file to spark submit via > pyfiles and use the generic launcher as the driver: > > Spark-submit —py-files jobs.py driver.py job-module > > > > > > On 25 May 2019, at 10:

Re: best docker image to use

2019-06-11 Thread Riccardo Ferrari
Hi Marcelo, I'm used to work with https://github.com/jupyter/docker-stacks. There's the Scala+jupyter option too. Though there might be better option with Zeppelin too. Hth On Tue, 11 Jun 2019, 11:52 Marcelo Valle, wrote: > Hi, > > I would like to run spark shell + scala on a docker environmen

Re: spark standalone mode problem about executor add and removed again and again!

2019-07-18 Thread Riccardo Ferrari
I would also check firewall rules. Is communication allowed on all the required port ranges and hosts ? On Thu, Jul 18, 2019 at 3:56 AM Amit Sharma wrote: > Do you have dynamic resource allocation enabled? > > > On Wednesday, July 17, 2019, zenglong chen > wrote: > >> Hi,all, >> My stan

Cluster sizing

2019-09-13 Thread Riccardo Ferrari
Hi list, Is there any documentation about how to approach cluster sizing. How do you approach a new deployment? Thanks,

Re: Compute the Hash of each row in new column

2020-02-28 Thread Riccardo Ferrari
Hi Chetan, Would the sql function `hash` do the trick for your use-case ? Best, On Fri, Feb 28, 2020 at 1:56 PM Chetan Khatri wrote: > Hi Spark Users, > How can I compute Hash of each row and store in new column at Dataframe, > could someone help me. > > Thanks >