spark.hadoop.fs.s3a.secret.key=$s3Secret \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
${execJarPath}
I am using Spark v 2.3.0 along with scala in Standalone cluster node
with three workers.
Cheers
Marius
and they
are too large to justify copying them around using addFile. If this is
not possible i would like to know if the community be interested in such
a feature.
Cheers
Marius
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
the s3 handler is not using the provided credentials.
Has anyone an idea how to fix this?
Cheers and thanks in Advance
Marius
it works similarly as reducebykey.
>
> On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com
> <mailto:mps@gmail.com>> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
>
>> On 11.08.2016, at 05:42, Holden Karau &
In DataFrames (and thus in 1.5 in general) this is not possible, correct?
> On 11.08.2016, at 05:42, Holden Karau wrote:
>
> Hi Luis,
>
> You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you
> can do groupBy followed by a reduce on the
That's to be expected - the application UI is not started by the master, but by
the driver. So the UI will run on the machine that submits the job.
> On 26.07.2016, at 15:49, Jestin Ma wrote:
>
> I did netstat -apn | grep 4040 on machine 6, and I see
>
> tcp
normal join. This should be faster than joining and subtracting then.
> Anyway, thanks for the hint of the transformWith method!
>
> Am 27. Juni 2016 um 14:32 schrieb Marius Soutier <mps@gmail.com
> <mailto:mps@gmail.com>>:
> `transformWith` accepts another stream
Can't you use `transform` instead of `foreachRDD`?
> On 15.06.2016, at 15:18, Matthias Niehoff
> wrote:
>
> Hi,
>
> i want to subtract 2 DStreams (based on the same Input Stream) to get all
> elements that exist in the original stream, but not in the
> On 04.03.2016, at 22:39, Cody Koeninger wrote:
>
> The only other valid use of messageHandler that I can think of is
> catching serialization problems on a per-message basis. But with the
> new Kafka consumer library, that doesn't seem feasible anyway, and
> could be
Found an issue for this:
https://issues.apache.org/jira/browse/SPARK-10251
<https://issues.apache.org/jira/browse/SPARK-10251>
> On 09.09.2015, at 18:00, Marius Soutier <mps@gmail.com> wrote:
>
> Hi all,
>
> as indicated in the title, I’m using Kryo wi
with Tuple2, which I cannot serialize sanely
for all specialized forms. According to the documentation, this should be
handled by Chill. Is this a bug or what am I missing?
I’m using Spark 1.4.1.
Cheers
- Marius
-
To unsubscribe
If you takes time to actually learn Scala starting from its fundamental
concepts AND quite importantly get familiar with general functional
programming concepts, you'd immediately realize the things that you'd
really miss going back to Java (8).
On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła
Hi,
This is an ugly solution because it requires pulling out a row:
val rdd: RDD[Row] = ...
ctx.createDataFrame(rdd, rdd.first().schema)
Is there a better alternative to get a DataFrame from an RDD[Row] since
toDF won't work as Row is not a Product ?
Thanks,
Marius
suspect there are other technical reasons*).
If anyone know the depths of the problem if would be of great help.
Best,
Marius
On Fri, Jul 3, 2015 at 6:43 PM Silvio Fiorito silvio.fior...@granturing.com
wrote:
One thing you could do is a broadcast join. You take your smaller RDD,
save
function, all running in the same state without any
other costs.
Best,
Marius
Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com
wrote:
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned
seemed a natural fit ( ... I am aware of its limitations).
Thanks,
Marius
On Mon, May 4, 2015 at 10:45 PM Imran Rashid iras...@cloudera.com wrote:
Hi Marius,
I am also a little confused -- are you saying that myPartitions is
basically something like:
class MyPartitioner extends Partitioner
nodes. In my case I see 2 yarn containers receiving records during a
mapPartition operation applied on the sorted partition. I need to test more
but it seems that applying the same partitioner again right before the
last mapPartition can
help.
Best,
Marius
On Tue, Apr 28, 2015 at 4:40 PM Silvio
explain and f fails.
The overall behavior of this job is that sometimes it succeeds and
sometimes it fails ... apparently due to inconsistent propagation of sorted
records to yarn containers.
If any of this makes any sense to you, please let me know what I am missing.
Best,
Marius
Anyone ?
On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com
wrote:
Hello anyone,
I have a question regarding the sort shuffle. Roughly I'm doing something
like:
rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
The problem is that in f2 I don't see
Same problem here...
On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote:
Hi all,
it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on
the mirror sites. Am I missing something?
Regards,
Zsolt
The processing speed displayed in the UI doesn’t seem to take everything into
account. I also had a low processing time but had to increase batch duration
from 30 seconds to 1 minute because waiting batches kept increasing. Now it
runs fine.
On 17.04.2015, at 13:30, González Salgado, Miquel
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically.
From the source code comments:
// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.
On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Does
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are
are restarting long running jobs once in a while for cleanups and have
spark.cleaner.ttl set to a lower value than the default.
On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote:
Right, I
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=seconds
On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Does anybody have an answer for
recods per receiver per batch. I have 5
actor streams (one per node) with 10 total cores assigned.
Driver has 3 GB RAM, each worker 4 GB.
There is certainly no memory pressure, Memory Used is around 100kb, Input
is around 10 MB.
Thanks for any pointers,
- Marius
1. I don't think textFile is capable of unpacking a .gz file. You need to use
hadoopFile or newAPIHadoop file for this.
Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do is
compute splits on gz files, so if you have a single file, you'll have a single
partition.
/streaming-programming-guide.html#dataframe-and-sql-operations
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations
TD
On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier mps@gmail.com
mailto:mps@gmail.com wrote:
Forgot
$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
Cheers
- Marius
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
On 11.03.2015, at 18:35, Marius Soutier mps@gmail.com wrote:
Hi,
I’ve written a Spark Streaming Job that inserts into a Parquet, using
stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added
-n on your machine.
On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com
mailto:mps@gmail.com wrote:
Hi Sameer,
I’m still using Spark 1.1.1, I think the default is hash shuffle. No external
shuffle service.
We are processing gzipped JSON files, the partitions
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job speed
by avoiding network transfers? Assuming I’m doing joins or other shuffle
operations.
Thanks
for computations?
yes they can.
On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote:
Hi,
just a quick question about calling persist with the _2 option. Is the 2x
replication only useful for fault tolerance, or will it also increase job
speed by avoiding network transfers
. Everything above that make it very likely
it will crash, even on smaller datasets (~300 files). But I’m not sure if this
is related to the above issue.
On 23.02.2015, at 18:15, Sameer Farooqui same...@databricks.com wrote:
Hi Marius,
Are you using the sort or hash shuffle?
Also, do you have
, following jobs will struggle with completion. There are a lot
of failures without any exception message, only the above mentioned lost
executor. As soon as I clear out /var/run/spark/work/ and the spill disk,
everything goes back to normal.
Thanks for any hint,
- Marius
):
java.io.FileNotFoundException:
/tmp/spark-local-20150210030009-b4f1/3f/shuffle_4_655_49 (No space left on
device)
Even though there’s plenty of disk space left.
On 10.02.2015, at 00:09, Muttineni, Vinay vmuttin...@ebay.com wrote:
Hi Marius,
Did you find a solution to this problem? I get the same error.
Thanks,
Vinay
, most
recent failure: Lost task 10.3 in stage 2.0 (TID 20, xxx.compute.internal):
ExecutorLostFailure (executor lost)
Driver stacktrace:
Is there any way to understand what’s going on? The logs don’t show anything.
I’m using Spark 1.1.1.
Thanks
- Marius
)
at scala.Option.foreach(Option.scala:236)
On 15.12.2014, at 22:36, Marius Soutier mps@gmail.com wrote:
Ok, maybe these test versions will help me then. I’ll check it out.
On 15.12.2014, at 22:33, Michael Armbrust mich...@databricks.com wrote:
Using a single SparkContext should not cause this problem
Hi,
is there an easy way to “migrate” parquet files or indicate optional values in
sql statements? I added a couple of new fields that I also use in a
schemaRDD.sql() which obviously fails for input files that don’t have the new
fields.
Thanks
- Marius
of HiveContext. It does not seem to have anything to do with
the actual files that I also create during the test run with
SQLContext.saveAsParquetFile.
Cheers
- Marius
PS The full trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in
stage 6.0 failed 1 times
of our
unit testing.
On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier mps@gmail.com wrote:
Possible, yes, although I’m trying everything I can to prevent it, i.e. fork
in Test := true and isolated. Can you confirm that reusing a single
SparkContext for multiple tests poses a problem as well
You can also insert into existing tables via .insertInto(tableName, overwrite).
You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello,
I'm writing a process that ingests json files and saves them a parquet files.
The process is as such:
Default value is infinite, so you need to enable it. Personally I’ve setup a
couple of cron jobs to clean up /tmp and /var/run/spark.
On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote:
Hello,
Spark has an internal cleanup mechanism
(defined by spark.cleaner.ttl, see
I did some simple experiments with Impala and Spark, and Impala came out ahead.
But it’s also less flexible, couldn’t handle irregular schemas, didn't support
Json, and so on.
On 01.11.2014, at 02:20, Soumya Simanta soumya.sima...@gmail.com wrote:
I agree. My personal experience with Spark
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my
Hadoop dependencies to run a SparkContext.
In your build.sbt:
org.apache.hadoop % hadoop-common % “... exclude(javax.servlet,
servlet-api),
org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet,
servlet-api”)
Are these /vols formatted? You typically need to format and define a mount
point in /mnt for attached EBS volumes.
I’m not using the ec2 script, so I don’t know what is installed, but there’s
usually an HDFS info service running on port 50070. After changing
hdfs-site.xml, you have to restart
So, apparently `wholeTextFiles` runs the job again, passing null as argument
list, which in turn blows up my argument parsing mechanics. I never thought I
had to check for null again in a pure Scala environment ;)
On 26.10.2014, at 11:57, Marius Soutier mps@gmail.com wrote:
I tried
From: Marius Soutier [mps@gmail.com]
Sent: Friday, October 24, 2014 6:35 AM
To: user@spark.apache.org
Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since
Spark 1.1.0
Hi,
I’m running a job whose simple task it is to find files
${t.getStackTrace.head})
}
}
Also since 1.1.0, the printlns are no longer visible on the console, only in
the Spark UI worker output.
Thanks for any help
- Marius
-
To unsubscribe, e-mail: user-unsubscr
for any insights,
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
plan is quite
different, I’m seeing writes to the output much later than in Scala.
Is Python I/O really that slow?
Thanks
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail
that the execution
plan is quite different, I’m seeing writes to the output much later than in
Scala.
Is Python I/O really that slow?
Thanks
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Didn’t seem to help:
conf = SparkConf().set(spark.shuffle.spill,
false).set(spark.default.parallelism, 12)
sc = SparkContext(appName=’app_name', conf = conf)
but still taking as much time
On 22.10.2014, at 14:17, Nicholas Chammas nicholas.cham...@gmail.com wrote:
Total guess without knowing
Yeah we’re using Python 2.7.3.
On 22.10.2014, at 20:06, Nicholas Chammas nicholas.cham...@gmail.com wrote:
On Wed, Oct 22, 2014 at 11:34 AM, Eustache DIEMERT eusta...@diemert.fr
wrote:
Wild guess maybe, but do you decode the json records in Python ? it could be
much slower as the
Can’t install that on our cluster, but I can try locally. Is there a pre-built
binary available?
On 22.10.2014, at 19:01, Davies Liu dav...@databricks.com wrote:
In the master, you can easily profile you job, find the bottlenecks,
see https://github.com/apache/spark/pull/2556
Could you try
Hello,
sc.textFile and so on support wildcards in their path, but apparently
sqlc.parquetFile() does not. I always receive “File
/file/to/path/*/input.parquet does not exist. Is this normal or a bug? Is
there are a workaround?
Thanks
- Marius
Thank you, that works!
On 24.09.2014, at 19:01, Michael Armbrust mich...@databricks.com wrote:
This behavior is inherited from the parquet input format that we use. You
could list the files manually and pass them as a comma separated list.
On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for
some strange SQL parser errors). However the problem remains, how do I get that
data back to a dashboard. So I guess I’ll have to use a database after all.
You can batch up data store into parquet partitions as
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
--
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
On Fri, Sep 12, 2014 at 2:54 PM, Marius Soutier mps@gmail.com wrote:
Hi there,
I’m pretty new to Spark
So you are living the dream of using HDFS as a database? ;)
On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote:
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)
aℕdy ℙetrella
about.me/noootsab
On Mon, Sep 15, 2014 at 1:41 PM, Marius
or databases?
Thanks
- Marius
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
61 matches
Mail list logo