Re: restarting ranger kms causes spark thrift server to stop

2018-06-24 Thread Rick Moritz
Hi, >From what I can tell, that's an error in Ranger, not in Spark, as you can see by the package where the exception is thrown. Spark Thrift server in this instance is merely trying to call a Hadoop API, which then gets hijacked by Ranger. Your best bet is to look at the case in question, try to

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-26 Thread Rick Moritz
Hi, We solved this the ugly way, when parsing external column definitions: private def columnTypeToFieldType(columnType: String): DataType = { columnType match { case "IntegerType" => IntegerType case "StringType" => StringType case "DateType" => DateType case "FloatType" => Flo

Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-11 Thread Rick Moritz
Hi Gerard, hi List, I think what this would entail is for Source.commit to change its funcationality. You would need to track all streams' offsets there. Especially in the socket source, you already have a cache (haven't looked at Kafka's implementation to closely yet), so that shouldn't be the is

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Rick Moritz
Put your jobs into a parallel collection using .par -- then you can submit them very easily to Spark, using .foreach. The jobs will then run using the FIFO scheduler in Spark. The advantage over the prior approaches are, that you won't have to deal with Threads, and that you can leave parallelism

Re: "Sharing" dataframes...

2017-06-21 Thread Rick Moritz
Keeping it inside the same program/SparkContext is the most performant solution, since you can avoid serialization and deserialization. In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and invokes serialization and deserialization. Technologies that can help you do that easi

Re: Spark consumes more memory

2017-05-11 Thread Rick Moritz
I would try to track down the "no space left on device" - find out where that originates from, since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily. In that case,you may want to increase overhead, so yarn doesn't kill your executors. Check that no local driv

Re: Create multiple columns in pyspak with one shot

2017-05-04 Thread Rick Moritz
In Scala you can first define your columns, and then use the list-to-vararg-expander :_* in a select call, something like this: val cols = colnames.map(col).map(column => { *lit(0)* }) dF.select(cols: _*) I assume something similar should be possible in Java as well, from your snippet it's unc

Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Rick Moritz
Hi List, I'm wondering if the following behaviour should be considered a bug, or whether it "works as designed": I'm starting multiple concurrent (FIFO-scheduled) jobs in a single SparkContext, some of which write into the same tables. When these tables already exist, it appears as though both jo

Re: Yarn containers getting killed, error 52, multiple joins

2017-04-14 Thread Rick Moritz
Potentially, with joins, you run out of memory on a single executor, because a small skew in your data is being amplified. You could try to increase the default number of partitions, reduce the number of simultaneous tasks in execution (executor.num.cores), or add a repartitioning operation before/

Feasability limits of joins in SparkSQL (Why does my driver explode with a large number of joins?)

2017-04-11 Thread Rick Moritz
Hi List, I'm currently trying to naively implement a Data-Vault-type Data-Warehouse using SparkSQL, and was wondering whether there's an inherent practical limit to query complexity, beyond which SparkSQL will stop functioning, even for relatively small amounts of data. I'm currently looking at a

Re: RE: Fast write datastore...

2017-03-16 Thread Rick Moritz
If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet might also be an option. Of course, management-wise it has much more overhead than using ES, since you need to manually define partitions and buckets, which is suboptimal. On the other hand, for querying, you can probably

Re: Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
n around two years or so". So now is to finding out why that's the case, and how to actually get to the point, where these features could work in 2 years, and whether they should work at all On Tue, Jan 17, 2017 at 6:38 PM, Sean Owen wrote: > On Tue, Jan 17, 2017 at 4:49 PM Rick

Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
Hi List, I've been following several projects with quite some interest over the past few years, and I've continued to wonder, why they're not moving towards a degree of being supported by mainstream Spark-distributions, and more frequently mentioned when it comes to enterprise adoption of Spark.

Re: Found Data Quality check package for Spark

2016-05-07 Thread Rick Moritz
Hi Divya, I haven't actually used the package yet, but maybe you should check out the gitter-room where the creator is quite active. You can find it on https://gitter.im/FRosner/drunken-data-quality . There you should be able to get the information you need. Best, Rick On 6 May 2016 12:34, "Div

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread Rick Moritz
Something to check (just in case): Are you getting identical results each time? On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote: > Hi sparkers, > > I am using dataframe to do some large ETL jobs. > More precisely, I create dataframe from HIVE table and do some operations. > And then I save it as

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-29 Thread Rick Moritz
and may be then do > an analysis. > > Best, > Kartik > > On Mon, Sep 28, 2015 at 11:38 AM, Rick Moritz wrote: > >> Hi Kartik, >> >> Thanks for the input! >> >> Sadly, that's not it - I'm using YARN - the configuration looks >> iden

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-28 Thread Rick Moritz
ell and were running much > faster using submit (which reads conf correctly) or zeppelin for that > matter. > > Thanks, > Kartik > > On Sun, Sep 27, 2015 at 11:45 PM, Rick Moritz wrote: > >> I've finally been able to pick this up again, after upgrading to Spark >>

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-27 Thread Rick Moritz
l generate more shuffled data for the same number of shuffled tuples? An analysis would be much appreciated. Best, Rick On Wed, Aug 19, 2015 at 2:47 PM, Rick Moritz wrote: > oops, forgot to reply-all on this thread. > > -- Forwarded message -- > From: Rick Morit

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
used JDK 7. Or some later repackaging process ran on the > artifacts and used Java 6. I do see "Build-Jdk: 1.6.0_45" in the > manifest, but I don't think 1.4.x can compile with Java 6. > > On Tue, Aug 25, 2015 at 9:59 PM, Rick Moritz wrote: > > A quick question r

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
A quick question regarding this: how come the artifacts (spark-core in particular) on Maven Central are built with JDK 1.6 (according to the manifest), if Java 7 is required? On Aug 21, 2015 5:32 PM, "Sean Owen" wrote: > Spark 1.4 requires Java 7. > > On Fri, Aug 21, 2015, 3:12 PM Chen Song wrot

Fwd: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
oops, forgot to reply-all on this thread. -- Forwarded message -- From: Rick Moritz Date: Wed, Aug 19, 2015 at 2:46 PM Subject: Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell To: Igor Berman Those values are not explicitely set, and attempting to read

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
19 August 2015 at 09:49, Rick Moritz wrote: > >> Dear list, >> >> I am observing a very strange difference in behaviour between a Spark >> 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin >> interpreter (compiled with Java 6 and sourced from

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
any effect on shuffling. On Wed, Aug 19, 2015 at 8:49 AM, Rick Moritz wrote: > Dear list, > > I am observing a very strange difference in behaviour between a Spark > 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin > interpreter (compiled with Java 6 and sou

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-18 Thread Rick Moritz
spark-submit it using different spark-binaries to further explore the issue. Best Regards, Rick Moritz PS: I already tried to send this mail yesterday, but it never made it onto the list, as far as I can tell -- I apologize should anyone receive this as a second copy.

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-18 Thread Rick Moritz
it it using different spark-binaries to further explore the issue. Best Regards, Rick Moritz

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-11 Thread Rick Moritz
Consider the spark.max.cores configuration option -- it should do what you require. On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula < aharipriy...@gmail.com> wrote: > Hello all, > > As a quick follow up for this, I have been using Spark on Yarn till now > and am currently exploring Me

Re: Spark Maven Test error

2015-06-10 Thread Rick Moritz
Dear List, I'm trying to reference a lonely message to this list from March 25th,( http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html ), but I'm unsure this will thread properly. Sorry, if didn't work out. Anyway, using Spark 1.4.0-RC4 I run into the same issu