from:"\"Spico Florin\""

Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Spico Florin

Hello! I would like to use the logging mechanism provided by the log4j, but I'm getting the Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger The code (and the problem) that I'm using resemble

Re: Using Log4j for logging messages inside lambda functions

2015-05-26 Thread Spico Florin

factory object that you can create your log object from. > > On Mon, May 25, 2015 at 11:05 PM, Spico Florin > wrote: > >> Hello! >> I would like to use the logging mechanism provided by the log4j, but >> I'm getting the >> Exception in thread &q

Re: Does saveAsHadoopFile depend on master?

2016-06-22 Thread Spico Florin

Hi! I had a similar issue when the user that submit the job to the spark cluster didn't have permission to write into the hdfs. If you have the hdfs GUI then you can check which users are and what permissions. Also can in hdfs browser:( http://stackoverflow.com/questions/27996034/opening-a-hdfs-fi

Re: testing frameworks

2018-05-30 Thread Spico Florin

Hello! I'm also looking for unit testing spark Java application. I've seen the great work done in spark-testing-base but it seemed to me that I could not use for Spark Java applications. Only spark scala applications are supported? Thanks. Regards, Florin On Wed, May 23, 2018 at 8:07 AM, umarg

Re: testing frameworks

2018-06-04 Thread Spico Florin

with Java > applications - > http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ > > On Wed, May 30, 2018 at 4:14 AM Spico Florin > wrote: > >> Hello! >> I'm also looking for unit testing spark Java application. I've seen the >> gre

Spark 1.6 change the number partitions without repartition and without shuffling

2018-06-13 Thread Spico Florin

Hello! I have a parquet file that has 60MB representing 10millions records. When I read this file using Spark 2.3.0 and with the configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29 partitions as expected. Code: sqlContext.setConf("spark.sql.files.maxPartitionBytes", Long

Dataframe to automatically create Impala table when writing to Impala

2018-06-22 Thread Spico Florin

Hello! I would like to know if there is any feature in Spark Dataframe that when is writing data to a Impala table, to also create that table when this table was not previously cretaed in Impala . For example, the code: myDatafarme.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, "books", connectio

Run/install tensorframes on zeppelin pyspark

2018-08-08 Thread Spico Florin

Hi! I would like to use tensorframes in my pyspark notebook. I have performed the following: 1. In the spark intepreter adde a new repository http://dl.bintray.com/spark-packages/maven 2. in the spark interpreter added the dependency databricks:tensorframes:0.2.9-s_2.11 3. pip install tensorfram

Re: Run/install tensorframes on zeppelin pyspark

2018-08-10 Thread Spico Florin

If yes how? I look forward for your answers/ Regards, Florin On Thu, Aug 9, 2018 at 3:52 AM, Jeff Zhang wrote: > > Make sure you use the correct python which has tensorframe installed. Use > PYSPARK_PYTHON > to configure the python > > > > S

Dataframe Publish to RabbitMQ

2019-06-21 Thread Spico Florin

Hello! Can you please share some code/thoughts on how to publish data from a dataframe to RabbbitMQ? Thanks. Regards, Florin

Kafka Integration libraries put in the fat jar

2019-07-30 Thread Spico Florin

Hello! I would like to use the spark structured streaming integrated with Kafka the way is described here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html but I got the following issue: Caused by: org.apache.spark.sql.AnalysisException: Failed to find data sourc

Re: Kafka Integration libraries put in the fat jar

2019-07-31 Thread Spico Florin

dCount On Tue, Jul 30, 2019 at 4:38 PM Spico Florin wrote: > Hello! > > I would like to use the spark structured streaming integrated with Kafka > the way is described here: > > https://spark.apache.org/docs/latest/structured-streamin

[Spark Streaming] Apply multiple ML pipelines(Models) to the same stream

2019-10-31 Thread Spico Florin

Hello! I have an use case where I have to apply multiple already trained models (e.g. M1, M2, ..Mn) on the same spark stream ( fetched from kafka). The models were trained usining the isolation forest algorithm from here: https://github.com/titicaca/spark-iforest I have found something similar

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Spico Florin

Hi! What file system are you using: EMRFS or HDFS? Also what memory are you using for the reducer ? On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote: > I ran into the same issue processing 20TB of data, with 200k tasks on both > the map and reduce sides. Reducing to 100k tasks each resolved the

Re: Problem with Kafka group.id

2020-03-24 Thread Spico Florin

Hello! Maybe you can find more information on the same issue reported here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSourceProvider.html validateGeneralOptions makes sure that group.id has not been specified and reports an IllegalArgumentException ot

Errors in the workers machines

2015-02-05 Thread Spico Florin

Hello! I received the following errors in the workerLog.log files: ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660] -> [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with [akka.tcp://sparkExecutor@stream4:47929]] [ akka.remote.EndpointAssociationE

Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Spico Florin

Hello! I'm using spark-csv 2.10 with Java from the maven repository com.databricks spark-csv_2.10 0.1.1 I would like to use Spark-SQL to filter out my data. I'm using the following code: JavaSchemaRDD cars = new JavaCsvParser().withUseHeader(true).csvFile( sqlContext, logFile); cars.registerAsTabl

Parsing CSV files in Spark

2015-02-06 Thread Spico Florin

Hi! I'm new to Spark. I have a case study that where the data is store in CSV files. These files have headers with morte than 1000 columns. I would like to know what are the best practice to parsing them and in special the following points: 1. Getting and parsing all the files from a folder 2. Wh

MLib usage on Spark Streaming

2015-02-16 Thread Spico Florin

Hello! I'm newbie to Spark and I have the following case study: 1. Client sending at 100ms the following data: {uniqueId, timestamp, measure1, measure2 } 2. Each 30 seconds I would like to correlate the data collected in the window, with some predefined double vector pattern for each given key.

Number of Executors per worker process

2015-02-25 Thread Spico Florin

Hello! I've read the documentation about the spark architecture, I have the following questions: 1: how many executors can be on a single worker process (JMV)? 2:Should I think executor like a Java Thread Executor where the pool size is equal with the number of the given cores (set up by the SPARK

Re: Number of Executors per worker process

2015-03-02 Thread Spico Florin

on your machine: > ps aux | grep spark.deploy => will show master, worker (coordinator) and > driver JVMs > ps aux | grep spark.executor => will show the actual worker JVMs > > 2015-02-25 14:23 GMT+01:00 Spico Florin : > >> Hello! >> I've read the document

Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Spico Florin

Hello! I would like to know what is the optimal solution for getting the header with from a CSV file with Spark? My aproach was: def getHeader(data: RDD[String]): String = { data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") } Thanks.

Re: Optimal solution for getting the header from CSV with Spark

2015-03-25 Thread Spico Florin

; > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin > wrote: > >> Hello! >> >> I would like to know what is the optimal solution for getting the header >> with from a CSV file with Spar

Matrix Transpose

2015-04-02 Thread Spico Florin

Hello! I have a CSV file that has the following content: C1;C2;C3 11;22;33 12;23;34 13;24;35 What is the best approach to use Spark (API, MLLib) for achieving the transpose of it? C1 11 12 13 C2 22 23 24 C3 33 34 35 I look forward for your solutions and suggestions (some Scala code will be rea

Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-15 Thread Spico Florin

Hello! The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations) val data: RDD[Vector] = ... val correlMatrix: Matrix = Statistics.corr(data, "pearson") I would like to save the

Re: Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-16 Thread Spico Florin

can turn the Matrix to an Array with .toArray and then: > 1- Write it using Scala/Java IO to the local disk of the driver > 2- parallelize it and use .saveAsTextFile > > 2015-04-15 14:16 GMT+02:00 Spico Florin : > >> Hello! >> >> The result of cor

Order of execution of tasks inside of a stage and computing the number of stages

2015-04-20 Thread Spico Florin

Hello! I'm newbie in spark I would like to understand some basic mechanism on how it works behind the scenes. I have attached the lineage of my RDD and I have the following questions: 1. Why do I have 8 stages instead of 5? From the book Learning from Spark (Chapter 8 -http://bit.ly/1E0Hah7), I cou

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Spico Florin

Hello! I know that HadoopRDD partitions are built based on the number of splits in HDFS. I'm wondering if these partitions preserve the initial order of data in file. As an example, if I have an HDFS (myTextFile) file that has these splits: split 0-> line 1, ..., line k split 1->line k+1,..., li

Spark RDD sortByKey triggering a new job

2015-04-24 Thread Spico Florin

I have tested sortByKey method with the following code and I have observed that is triggering a new job when is called. I could find this in the neither in API nor in the code. Is this an indented behavior? For example, the RDD zipWithIndex method API specifies that will trigger a new job. But what

Using Log4j for logging messages inside lambda functions

Re: Using Log4j for logging messages inside lambda functions

Re: Does saveAsHadoopFile depend on master?

Re: testing frameworks

Re: testing frameworks

Spark 1.6 change the number partitions without repartition and without shuffling

Dataframe to automatically create Impala table when writing to Impala

Run/install tensorframes on zeppelin pyspark

Re: Run/install tensorframes on zeppelin pyspark

Dataframe Publish to RabbitMQ

Kafka Integration libraries put in the fat jar

Re: Kafka Integration libraries put in the fat jar

[Spark Streaming] Apply multiple ML pipelines(Models) to the same stream

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

Re: Problem with Kafka group.id

Errors in the workers machines

Reading from CSV file with spark-csv_2.10

Parsing CSV files in Spark

MLib usage on Spark Streaming

Number of Executors per worker process

Re: Number of Executors per worker process

Optimal solution for getting the header from CSV with Spark

Re: Optimal solution for getting the header from CSV with Spark

Matrix Transpose

Save org.apache.spark.mllib.linalg.Matri to a file

Re: Save org.apache.spark.mllib.linalg.Matri to a file

Order of execution of tasks inside of a stage and computing the number of stages

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

Spark RDD sortByKey triggering a new job

29 matches

Site Navigation

Mail list logo

Footer information