Hello!
I would like to use the logging mechanism provided by the log4j, but I'm
getting the
Exception in thread "main" org.apache.spark.SparkException: Task not
serializable -> Caused by: java.io.NotSerializableException:
org.apache.log4j.Logger
The code (and the problem) that I'm using resemble
factory object that you can create your log object from.
>
> On Mon, May 25, 2015 at 11:05 PM, Spico Florin
> wrote:
>
>> Hello!
>> I would like to use the logging mechanism provided by the log4j, but
>> I'm getting the
>> Exception in thread &q
Hi!
I had a similar issue when the user that submit the job to the spark
cluster didn't have permission to write into the hdfs. If you have the hdfs
GUI then you can check which users are and what permissions. Also can in
hdfs browser:(
http://stackoverflow.com/questions/27996034/opening-a-hdfs-fi
Hello!
I'm also looking for unit testing spark Java application. I've seen the
great work done in spark-testing-base but it seemed to me that I could not
use for Spark Java applications.
Only spark scala applications are supported?
Thanks.
Regards,
Florin
On Wed, May 23, 2018 at 8:07 AM, umarg
with Java
> applications -
> http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
>
> On Wed, May 30, 2018 at 4:14 AM Spico Florin
> wrote:
>
>> Hello!
>> I'm also looking for unit testing spark Java application. I've seen the
>> gre
Hello!
I have a parquet file that has 60MB representing 10millions records.
When I read this file using Spark 2.3.0 and with the
configuration spark.sql.files.maxPartitionBytes=1024*1024*2 (=2MB) I got 29
partitions as expected.
Code:
sqlContext.setConf("spark.sql.files.maxPartitionBytes",
Long
Hello!
I would like to know if there is any feature in Spark Dataframe that when
is writing data to a Impala table, to also create that table when this
table was not previously cretaed in Impala .
For example, the code:
myDatafarme.write.mode(SaveMode.Overwrite).jdbc(jdbcURL, "books",
connectio
Hi!
I would like to use tensorframes in my pyspark notebook.
I have performed the following:
1. In the spark intepreter adde a new repository
http://dl.bintray.com/spark-packages/maven
2. in the spark interpreter added the
dependency databricks:tensorframes:0.2.9-s_2.11
3. pip install tensorfram
If yes how?
I look forward for your answers/
Regards,
Florin
On Thu, Aug 9, 2018 at 3:52 AM, Jeff Zhang wrote:
>
> Make sure you use the correct python which has tensorframe installed. Use
> PYSPARK_PYTHON
> to configure the python
>
>
>
> S
Hello!
Can you please share some code/thoughts on how to publish data from a
dataframe to RabbbitMQ?
Thanks.
Regards,
Florin
Hello!
I would like to use the spark structured streaming integrated with Kafka
the way is described here:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
but I got the following issue:
Caused by: org.apache.spark.sql.AnalysisException: Failed to find data
sourc
dCount
On Tue, Jul 30, 2019 at 4:38 PM Spico Florin wrote:
> Hello!
>
> I would like to use the spark structured streaming integrated with Kafka
> the way is described here:
>
> https://spark.apache.org/docs/latest/structured-streamin
Hello!
I have an use case where I have to apply multiple already trained models
(e.g. M1, M2, ..Mn) on the same spark stream ( fetched from kafka).
The models were trained usining the isolation forest algorithm from here:
https://github.com/titicaca/spark-iforest
I have found something similar
Hi!
What file system are you using: EMRFS or HDFS?
Also what memory are you using for the reducer ?
On Thu, Nov 7, 2019 at 8:37 PM abeboparebop wrote:
> I ran into the same issue processing 20TB of data, with 200k tasks on both
> the map and reduce sides. Reducing to 100k tasks each resolved the
Hello!
Maybe you can find more information on the same issue reported here:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSourceProvider.html
validateGeneralOptions makes sure that group.id has not been specified and
reports an IllegalArgumentException ot
Hello!
I received the following errors in the workerLog.log files:
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660]
-> [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed with
[akka.tcp://sparkExecutor@stream4:47929]] [
akka.remote.EndpointAssociationE
Hello!
I'm using spark-csv 2.10 with Java from the maven repository
com.databricks
spark-csv_2.10
0.1.1
I would like to use Spark-SQL to filter out my data. I'm using the
following code:
JavaSchemaRDD cars = new JavaCsvParser().withUseHeader(true).csvFile(
sqlContext, logFile);
cars.registerAsTabl
Hi!
I'm new to Spark. I have a case study that where the data is store in CSV
files. These files have headers with morte than 1000 columns. I would like
to know what are the best practice to parsing them and in special the
following points:
1. Getting and parsing all the files from a folder
2. Wh
Hello!
I'm newbie to Spark and I have the following case study:
1. Client sending at 100ms the following data:
{uniqueId, timestamp, measure1, measure2 }
2. Each 30 seconds I would like to correlate the data collected in the
window, with some predefined double vector pattern for each given key.
Hello!
I've read the documentation about the spark architecture, I have the
following questions:
1: how many executors can be on a single worker process (JMV)?
2:Should I think executor like a Java Thread Executor where the pool size
is equal with the number of the given cores (set up by the
SPARK
on your machine:
> ps aux | grep spark.deploy => will show master, worker (coordinator) and
> driver JVMs
> ps aux | grep spark.executor => will show the actual worker JVMs
>
> 2015-02-25 14:23 GMT+01:00 Spico Florin :
>
>> Hello!
>> I've read the document
Hello!
I would like to know what is the optimal solution for getting the header
with from a CSV file with Spark? My aproach was:
def getHeader(data: RDD[String]): String = {
data.zipWithIndex().filter(_._2==0).map(x=>x._1).take(1).mkString("") }
Thanks.
;
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Tue, Mar 24, 2015 at 7:12 AM, Spico Florin
> wrote:
>
>> Hello!
>>
>> I would like to know what is the optimal solution for getting the header
>> with from a CSV file with Spar
Hello!
I have a CSV file that has the following content:
C1;C2;C3
11;22;33
12;23;34
13;24;35
What is the best approach to use Spark (API, MLLib) for achieving the
transpose of it?
C1 11 12 13
C2 22 23 24
C3 33 34 35
I look forward for your solutions and suggestions (some Scala code will be
rea
Hello!
The result of correlation in Spark MLLib is a of type
org.apache.spark.mllib.linalg.Matrix. (see
http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
I would like to save the
can turn the Matrix to an Array with .toArray and then:
> 1- Write it using Scala/Java IO to the local disk of the driver
> 2- parallelize it and use .saveAsTextFile
>
> 2015-04-15 14:16 GMT+02:00 Spico Florin :
>
>> Hello!
>>
>> The result of cor
Hello!
I'm newbie in spark I would like to understand some basic mechanism on how
it works behind the scenes.
I have attached the lineage of my RDD and I have the following questions:
1. Why do I have 8 stages instead of 5? From the book Learning from Spark
(Chapter 8 -http://bit.ly/1E0Hah7), I cou
Hello!
I know that HadoopRDD partitions are built based on the number of splits
in HDFS. I'm wondering if these partitions preserve the initial order of
data in file.
As an example, if I have an HDFS (myTextFile) file that has these splits:
split 0-> line 1, ..., line k
split 1->line k+1,..., li
I have tested sortByKey method with the following code and I have observed
that is triggering a new job when is called. I could find this in the
neither in API nor in the code. Is this an indented behavior? For example,
the RDD zipWithIndex method API specifies that will trigger a new job. But
what
29 matches
Mail list logo