Spark 2.3.0 --files vs. addFile()

2018-05-09 Thread Marius
spark.hadoop.fs.s3a.secret.key=$s3Secret \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ ${execJarPath} I am using Spark v 2.3.0 along with scala in Standalone cluster node with three workers. Cheers Marius

Spark Kubernetes Volumes

2018-04-12 Thread Marius
and they are too large to justify copying them around using addFile. If this is not possible i would like to know if the community be interested in such a feature. Cheers Marius - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Core]: S3a with Openstack swift object storage not using credentials provided in sparkConf

2017-11-15 Thread Marius
the s3 handler is not using the provided credentials. Has anyone an idea how to fix this? Cheers and thanks in Advance Marius

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
it works similarly as reducebykey. > > On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com > <mailto:mps@gmail.com>> wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > >> On 11.08.2016, at 05:42, Holden Karau &

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
In DataFrames (and thus in 1.5 in general) this is not possible, correct? > On 11.08.2016, at 05:42, Holden Karau wrote: > > Hi Luis, > > You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you > can do groupBy followed by a reduce on the

Re: Spark Web UI port 4040 not working

2016-07-27 Thread Marius Soutier
That's to be expected - the application UI is not started by the master, but by the driver. So the UI will run on the machine that submits the job. > On 26.07.2016, at 15:49, Jestin Ma wrote: > > I did netstat -apn | grep 4040 on machine 6, and I see > > tcp

Re: Substract two DStreams

2016-06-28 Thread Marius Soutier
normal join. This should be faster than joining and subtracting then. > Anyway, thanks for the hint of the transformWith method! > > Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mps@gmail.com > <mailto:mps@gmail.com>>: > `transformWith` accepts another stream

Re: Substract two DStreams

2016-06-27 Thread Marius Soutier
Can't you use `transform` instead of `foreachRDD`? > On 15.06.2016, at 15:18, Matthias Niehoff > wrote: > > Hi, > > i want to subtract 2 DStreams (based on the same Input Stream) to get all > elements that exist in the original stream, but not in the

Re: Use cases for kafka direct stream messageHandler

2016-03-08 Thread Marius Soutier
> On 04.03.2016, at 22:39, Cody Koeninger wrote: > > The only other valid use of messageHandler that I can think of is > catching serialization problems on a per-message basis. But with the > new Kafka consumer library, that doesn't seem feasible anyway, and > could be

Re: spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-10 Thread Marius Soutier
Found an issue for this: https://issues.apache.org/jira/browse/SPARK-10251 <https://issues.apache.org/jira/browse/SPARK-10251> > On 09.09.2015, at 18:00, Marius Soutier <mps@gmail.com> wrote: > > Hi all, > > as indicated in the title, I’m using Kryo wi

spark.kryo.registrationRequired: Tuple2 is not registered

2015-09-09 Thread Marius Soutier
with Tuple2, which I cannot serialize sanely for all specialized forms. According to the documentation, this should be handled by Chill. Is this a bug or what am I missing? I’m using Spark 1.4.1. Cheers - Marius - To unsubscribe

Re: Java 8 vs Scala

2015-07-17 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła

DataFrame from RDD[Row]

2015-07-16 Thread Marius Danciu
Hi, This is an ugly solution because it requires pulling out a row: val rdd: RDD[Row] = ... ctx.createDataFrame(rdd, rdd.first().schema) Is there a better alternative to get a DataFrame from an RDD[Row] since toDF won't work as Row is not a Product ? Thanks, Marius

Re: Optimizations

2015-07-03 Thread Marius Danciu
suspect there are other technical reasons*). If anyone know the depths of the problem if would be of great help. Best, Marius On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito silvio.fior...@granturing.com wrote: One thing you could do is a broadcast join. You take your smaller RDD, save

Optimizations

2015-07-03 Thread Marius Danciu
function, all running in the same state without any other costs. Best, Marius

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com wrote: Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
seemed a natural fit ( ... I am aware of its limitations). Thanks, Marius On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote: Hi Marius, I am also a little confused -- are you saying that myPartitions is basically something like: class MyPartitioner extends Partitioner

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
nodes. In my case I see 2 yarn containers receiving records during a mapPartition operation applied on the sorted partition. I need to test more but it seems that applying the same partitioner again right before the last mapPartition can help. Best, Marius On Tue, Apr 28, 2015 at 4:40 PM Silvio

Spark partitioning question

2015-04-28 Thread Marius Danciu
explain and f fails. The overall behavior of this job is that sometimes it succeeds and sometimes it fails ... apparently due to inconsistent propagation of sorted records to yarn containers. If any of this makes any sense to you, please let me know what I am missing. Best, Marius

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com wrote: Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like: rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see

Re: Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Marius Soutier
Same problem here... On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt

Re: Streaming problems running 24x7

2015-04-20 Thread Marius Soutier
The processing speed displayed in the UI doesn’t seem to take everything into account. I also had a low processing time but had to increase batch duration from 30 seconds to 1 minute because waiting batches kept increasing. Now it runs fine. On 17.04.2015, at 13:30, González Salgado, Miquel

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. From the source code comments: // SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the // application finishes. On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote: Does

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are are restarting long running jobs once in a while for cleanups and have spark.cleaner.ttl set to a lower value than the default. On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote: Right, I

Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example: export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=seconds On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Does anybody have an answer for

actorStream woes

2015-03-30 Thread Marius Soutier
recods per receiver per batch. I have 5 actor streams (one per node) with 10 total cores assigned. Driver has 3 GB RAM, each worker 4 GB. There is certainly no memory pressure, Memory Used is around 100kb, Input is around 10 MB. Thanks for any pointers, - Marius

Re: Processing of text file in large gzip archive

2015-03-16 Thread Marius Soutier
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is compute splits on gz files, so if you have a single file, you'll have a single partition.

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-12 Thread Marius Soutier
/streaming-programming-guide.html#dataframe-and-sql-operations http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier mps@gmail.com mailto:mps@gmail.com wrote: Forgot

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) Cheers - Marius

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)). On 11.03.2015, at 18:35, Marius Soutier mps@gmail.com wrote: Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added

Re: Executor lost with too many temp files

2015-02-26 Thread Marius Soutier
-n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com mailto:mps@gmail.com wrote: Hi Sameer, I’m still using Spark 1.1.1, I think the default is hash shuffle. No external shuffle service. We are processing gzipped JSON files, the partitions

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
for computations? yes they can. On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote: Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers

Re: Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
. Everything above that make it very likely it will crash, even on smaller datasets (~300 files). But I’m not sure if this is related to the above issue. On 23.02.2015, at 18:15, Sameer Farooqui same...@databricks.com wrote: Hi Marius, Are you using the sort or hash shuffle? Also, do you have

Executor lost with too many temp files

2015-02-23 Thread Marius Soutier
, following jobs will struggle with completion. There are a lot of failures without any exception message, only the above mentioned lost executor. As soon as I clear out /var/run/spark/work/ and the spill disk, everything goes back to normal. Thanks for any hint, - Marius

Re: Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-10 Thread Marius Soutier
): java.io.FileNotFoundException: /tmp/spark-local-20150210030009-b4f1/3f/shuffle_4_655_49 (No space left on device) Even though there’s plenty of disk space left. On 10.02.2015, at 00:09, Muttineni, Vinay vmuttin...@ebay.com wrote: Hi Marius, Did you find a solution to this problem? I get the same error. Thanks, Vinay

Executor Lost with StorageLevel.MEMORY_AND_DISK_SER

2015-02-09 Thread Marius Soutier
, most recent failure: Lost task 10.3 in stage 2.0 (TID 20, xxx.compute.internal): ExecutorLostFailure (executor lost) Driver stacktrace: Is there any way to understand what’s going on? The logs don’t show anything. I’m using Spark 1.1.1. Thanks - Marius

Re: Intermittent test failures

2014-12-17 Thread Marius Soutier
) at scala.Option.foreach(Option.scala:236) On 15.12.2014, at 22:36, Marius Soutier mps@gmail.com wrote: Ok, maybe these test versions will help me then. I’ll check it out. On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote: Using a single SparkContext should not cause this problem

Migrating Parquet inputs

2014-12-15 Thread Marius Soutier
Hi, is there an easy way to “migrate” parquet files or indicate optional values in sql statements? I added a couple of new fields that I also use in a schemaRDD.sql() which obviously fails for input files that don’t have the new fields. Thanks - Marius

Intermittent test failures

2014-12-15 Thread Marius Soutier
of HiveContext. It does not seem to have anything to do with the actual files that I also create during the test run with SQLContext.saveAsParquetFile. Cheers - Marius PS The full trace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 6.0 failed 1 times

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
of our unit testing. On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier mps@gmail.com wrote: Possible, yes, although I’m trying everything I can to prevent it, i.e. fork in Test := true and isolated. Can you confirm that reusing a single SparkContext for multiple tests poses a problem as well

Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such:

Re: Snappy temp files not cleaned up

2014-11-06 Thread Marius Soutier
Default value is infinite, so you need to enable it. Personally I’ve setup a couple of cron jobs to clean up /tmp and /var/run/spark. On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote: Hello, Spark has an internal cleanup mechanism (defined by spark.cleaner.ttl, see

Re: SparkSQL performance

2014-11-03 Thread Marius Soutier
I did some simple experiments with Impala and Spark, and Impala came out ahead. But it’s also less flexible, couldn’t handle irregular schemas, didn't support Json, and so on. On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote: I agree. My personal experience with Spark

Re: Submiting Spark application through code

2014-11-02 Thread Marius Soutier
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my Hadoop dependencies to run a SparkContext. In your build.sbt: org.apache.hadoop % hadoop-common % “... exclude(javax.servlet, servlet-api), org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet, servlet-api”)

Re: use additional ebs volumes for hsdf storage with spark-ec2

2014-11-01 Thread Marius Soutier
Are these /vols formatted? You typically need to format and define a mount point in /mnt for attached EBS volumes. I’m not using the ec2 script, so I don’t know what is installed, but there’s usually an HDFS info service running on port 50070. After changing hdfs-site.xml, you have to restart

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-27 Thread Marius Soutier
So, apparently `wholeTextFiles` runs the job again, passing null as argument list, which in turn blows up my argument parsing mechanics. I never thought I had to check for null again in a pure Scala environment ;) On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote: I tried

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-26 Thread Marius Soutier
From: Marius Soutier [mps@gmail.com] Sent: Friday, October 24, 2014 6:35 AM To: user@spark.apache.org Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0 Hi, I’m running a job whose simple task it is to find files

scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-24 Thread Marius Soutier
${t.getStackTrace.head}) } } Also since 1.1.0, the printlns are no longer visible on the console, only in the Spark UI worker output. Thanks for any help - Marius - To unsubscribe, e-mail: user-unsubscr

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
for any insights, - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Python vs Scala performance

2014-10-22 Thread Marius Soutier
plan is quite different, I’m seeing writes to the output much later than in Scala. Is Python I/O really that slow? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
that the execution plan is quite different, I’m seeing writes to the output much later than in Scala. Is Python I/O really that slow? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Didn’t seem to help: conf = SparkConf().set(spark.shuffle.spill, false).set(spark.default.parallelism, 12) sc = SparkContext(appName=’app_name', conf = conf) but still taking as much time On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote: Total guess without knowing

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Yeah we’re using Python 2.7.3. On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr wrote: Wild guess maybe, but do you decode the json records in Python ? it could be much slower as the

Re: Python vs Scala performance

2014-10-22 Thread Marius Soutier
Can’t install that on our cluster, but I can try locally. Is there a pre-built binary available? On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote: In the master, you can easily profile you job, find the bottlenecks, see https://github.com/apache/spark/pull/2556 Could you try

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist. Is this normal or a bug? Is there are a workaround? Thanks - Marius

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust mich...@databricks.com wrote: This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier

Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all. You can batch up data store into parquet partitions as

Re: Serving data

2014-09-15 Thread Marius Soutier
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe. -- Regards, Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote: Hi there, I’m pretty new to Spark

Re: Serving data

2014-09-15 Thread Marius Soutier
So you are living the dream of using HDFS as a database? ;) On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote: I'm using Parquet in ADAM, and I can say that it works pretty fine! Enjoy ;-) aℕdy ℙetrella about.me/noootsab On Mon, Sep 15, 2014 at 1:41 PM, Marius

Serving data

2014-09-12 Thread Marius Soutier
or databases? Thanks - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org