Re: restarting ranger kms causes spark thrift server to stop

2018-06-24 Thread Rick Moritz
Hi, >From what I can tell, that's an error in Ranger, not in Spark, as you can see by the package where the exception is thrown. Spark Thrift server in this instance is merely trying to call a Hadoop API, which then gets hijacked by Ranger. Your best bet is to look at the case in question, try

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-26 Thread Rick Moritz
Hi, We solved this the ugly way, when parsing external column definitions: private def columnTypeToFieldType(columnType: String): DataType = { columnType match { case "IntegerType" => IntegerType case "StringType" => StringType case "DateType" => DateType case "FloatType" =>

Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-11 Thread Rick Moritz
Hi Gerard, hi List, I think what this would entail is for Source.commit to change its funcationality. You would need to track all streams' offsets there. Especially in the socket source, you already have a cache (haven't looked at Kafka's implementation to closely yet), so that shouldn't be the

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Rick Moritz
Put your jobs into a parallel collection using .par -- then you can submit them very easily to Spark, using .foreach. The jobs will then run using the FIFO scheduler in Spark. The advantage over the prior approaches are, that you won't have to deal with Threads, and that you can leave parallelism

Re: "Sharing" dataframes...

2017-06-21 Thread Rick Moritz
Keeping it inside the same program/SparkContext is the most performant solution, since you can avoid serialization and deserialization. In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and invokes serialization and deserialization. Technologies that can help you do that

Re: Spark consumes more memory

2017-05-11 Thread Rick Moritz
I would try to track down the "no space left on device" - find out where that originates from, since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily. In that case,you may want to increase overhead, so yarn doesn't kill your executors. Check that no local

Re: Create multiple columns in pyspak with one shot

2017-05-04 Thread Rick Moritz
In Scala you can first define your columns, and then use the list-to-vararg-expander :_* in a select call, something like this: val cols = colnames.map(col).map(column => { *lit(0)* }) dF.select(cols: _*) I assume something similar should be possible in Java as well, from your snippet it's

Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

2017-04-20 Thread Rick Moritz
Hi List, I'm wondering if the following behaviour should be considered a bug, or whether it "works as designed": I'm starting multiple concurrent (FIFO-scheduled) jobs in a single SparkContext, some of which write into the same tables. When these tables already exist, it appears as though both

Re: Yarn containers getting killed, error 52, multiple joins

2017-04-14 Thread Rick Moritz
Potentially, with joins, you run out of memory on a single executor, because a small skew in your data is being amplified. You could try to increase the default number of partitions, reduce the number of simultaneous tasks in execution (executor.num.cores), or add a repartitioning operation

Feasability limits of joins in SparkSQL (Why does my driver explode with a large number of joins?)

2017-04-11 Thread Rick Moritz
Hi List, I'm currently trying to naively implement a Data-Vault-type Data-Warehouse using SparkSQL, and was wondering whether there's an inherent practical limit to query complexity, beyond which SparkSQL will stop functioning, even for relatively small amounts of data. I'm currently looking at

Re: RE: Fast write datastore...

2017-03-16 Thread Rick Moritz
If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet might also be an option. Of course, management-wise it has much more overhead than using ES, since you need to manually define partitions and buckets, which is suboptimal. On the other hand, for querying, you can probably

Re: Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
ars or so". So now is to finding out why that's the case, and how to actually get to the point, where these features could work in 2 years, and whether they should work at all On Tue, Jan 17, 2017 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote: > On Tue, Jan 17, 2017 at 4:49 PM

Middleware-wrappers for Spark

2017-01-17 Thread Rick Moritz
Hi List, I've been following several projects with quite some interest over the past few years, and I've continued to wonder, why they're not moving towards a degree of being supported by mainstream Spark-distributions, and more frequently mentioned when it comes to enterprise adoption of Spark.

Re: Found Data Quality check package for Spark

2016-05-07 Thread Rick Moritz
Hi Divya, I haven't actually used the package yet, but maybe you should check out the gitter-room where the creator is quite active. You can find it on https://gitter.im/FRosner/drunken-data-quality . There you should be able to get the information you need. Best, Rick On 6 May 2016 12:34,

Re: dataframe slow down with tungsten turn on

2015-11-04 Thread Rick Moritz
Something to check (just in case): Are you getting identical results each time? On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote: > Hi sparkers, > > I am using dataframe to do some large ETL jobs. > More precisely, I create dataframe from HIVE table and do some operations. >

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-29 Thread Rick Moritz
lizers are used and may be then do > an analysis. > > Best, > Kartik > > On Mon, Sep 28, 2015 at 11:38 AM, Rick Moritz <rah...@gmail.com> wrote: > >> Hi Kartik, >> >> Thanks for the input! >> >> Sadly, that's not it - I'm using YARN - the c

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-09-28 Thread Rick Moritz
more shuffled data for the same number of shuffled tuples? An analysis would be much appreciated. Best, Rick On Wed, Aug 19, 2015 at 2:47 PM, Rick Moritz <rah...@gmail.com> wrote: > oops, forgot to reply-all on this thread. > > -- Forwarded message -- > From

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
A quick question regarding this: how come the artifacts (spark-core in particular) on Maven Central are built with JDK 1.6 (according to the manifest), if Java 7 is required? On Aug 21, 2015 5:32 PM, Sean Owen so...@cloudera.com wrote: Spark 1.4 requires Java 7. On Fri, Aug 21, 2015, 3:12 PM

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
7. Or some later repackaging process ran on the artifacts and used Java 6. I do see Build-Jdk: 1.6.0_45 in the manifest, but I don't think 1.4.x can compile with Java 6. On Tue, Aug 25, 2015 at 9:59 PM, Rick Moritz rah...@gmail.com wrote: A quick question regarding this: how come the artifacts

Fwd: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
oops, forgot to reply-all on this thread. -- Forwarded message -- From: Rick Moritz rah...@gmail.com Date: Wed, Aug 19, 2015 at 2:46 PM Subject: Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell To: Igor Berman igor.ber...@gmail.com Those values

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
? On 19 August 2015 at 09:49, Rick Moritz rah...@gmail.com wrote: Dear list, I am observing a very strange difference in behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin interpreter (compiled with Java 6 and sourced from maven central

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
-submit it using different spark-binaries to further explore the issue. Best Regards, Rick Moritz PS: I already tried to send this mail yesterday, but it never made it onto the list, as far as I can tell -- I apologize should anyone receive this as a second copy.

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-08-11 Thread Rick Moritz
Consider the spark.max.cores configuration option -- it should do what you require. On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Hello all, As a quick follow up for this, I have been using Spark on Yarn till now and am currently exploring Mesos

Re: Spark Maven Test error

2015-06-10 Thread Rick Moritz
Dear List, I'm trying to reference a lonely message to this list from March 25th,( http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html ), but I'm unsure this will thread properly. Sorry, if didn't work out. Anyway, using Spark 1.4.0-RC4 I run into the same