Hi,
>From what I can tell, that's an error in Ranger, not in Spark, as you can
see by the package where the exception is thrown.
Spark Thrift server in this instance is merely trying to call a Hadoop API,
which then gets hijacked by Ranger.
Your best bet is to look at the case in question, try to
Hi,
We solved this the ugly way, when parsing external column definitions:
private def columnTypeToFieldType(columnType: String): DataType = {
columnType match {
case "IntegerType" => IntegerType
case "StringType" => StringType
case "DateType" => DateType
case "FloatType" => Flo
Hi Gerard, hi List,
I think what this would entail is for Source.commit to change its
funcationality. You would need to track all streams' offsets there.
Especially in the socket source, you already have a cache (haven't looked
at Kafka's implementation to closely yet), so that shouldn't be the is
Put your jobs into a parallel collection using .par -- then you can submit
them very easily to Spark, using .foreach. The jobs will then run using the
FIFO scheduler in Spark.
The advantage over the prior approaches are, that you won't have to deal
with Threads, and that you can leave parallelism
Keeping it inside the same program/SparkContext is the most performant
solution, since you can avoid serialization and deserialization.
In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
and invokes serialization and deserialization. Technologies that can help
you do that easi
I would try to track down the "no space left on device" - find out where
that originates from, since you should be able to allocate 10 executors
with 4 cores and 15GB RAM each quite easily. In that case,you may want to
increase overhead, so yarn doesn't kill your executors.
Check that no local driv
In Scala you can first define your columns, and then use the
list-to-vararg-expander :_* in a select call, something like this:
val cols = colnames.map(col).map(column => {
*lit(0)*
})
dF.select(cols: _*)
I assume something similar should be possible in Java as well, from
your snippet it's unc
Hi List,
I'm wondering if the following behaviour should be considered a bug, or
whether it "works as designed":
I'm starting multiple concurrent (FIFO-scheduled) jobs in a single
SparkContext, some of which write into the same tables.
When these tables already exist, it appears as though both jo
Potentially, with joins, you run out of memory on a single executor,
because a small skew in your data is being amplified. You could try to
increase the default number of partitions, reduce the number of
simultaneous tasks in execution (executor.num.cores), or add a
repartitioning operation before/
Hi List,
I'm currently trying to naively implement a Data-Vault-type Data-Warehouse
using SparkSQL, and was wondering whether there's an inherent practical
limit to query complexity, beyond which SparkSQL will stop functioning,
even for relatively small amounts of data.
I'm currently looking at a
If you have enough RAM/SSDs available, maybe tiered HDFS storage and
Parquet might also be an option. Of course, management-wise it has much
more overhead than using ES, since you need to manually define partitions
and buckets, which is suboptimal. On the other hand, for querying, you can
probably
n around two years or so".
So now is to finding out why that's the case, and how to actually get to
the point, where these features could work in 2 years, and whether they
should work at all
On Tue, Jan 17, 2017 at 6:38 PM, Sean Owen wrote:
> On Tue, Jan 17, 2017 at 4:49 PM Rick
Hi List,
I've been following several projects with quite some interest over the past
few years, and I've continued to wonder, why they're not moving towards a
degree of being supported by mainstream Spark-distributions, and more
frequently mentioned when it comes to enterprise adoption of Spark.
Hi Divya,
I haven't actually used the package yet, but maybe you should check out the
gitter-room where the creator is quite active. You can find it on
https://gitter.im/FRosner/drunken-data-quality .
There you should be able to get the information you need.
Best,
Rick
On 6 May 2016 12:34, "Div
Something to check (just in case):
Are you getting identical results each time?
On Wed, Nov 4, 2015 at 8:54 AM, gen tang wrote:
> Hi sparkers,
>
> I am using dataframe to do some large ETL jobs.
> More precisely, I create dataframe from HIVE table and do some operations.
> And then I save it as
and may be then do
> an analysis.
>
> Best,
> Kartik
>
> On Mon, Sep 28, 2015 at 11:38 AM, Rick Moritz wrote:
>
>> Hi Kartik,
>>
>> Thanks for the input!
>>
>> Sadly, that's not it - I'm using YARN - the configuration looks
>> iden
ell and were running much
> faster using submit (which reads conf correctly) or zeppelin for that
> matter.
>
> Thanks,
> Kartik
>
> On Sun, Sep 27, 2015 at 11:45 PM, Rick Moritz wrote:
>
>> I've finally been able to pick this up again, after upgrading to Spark
>>
l generate more shuffled data for the same number of shuffled
tuples?
An analysis would be much appreciated.
Best,
Rick
On Wed, Aug 19, 2015 at 2:47 PM, Rick Moritz wrote:
> oops, forgot to reply-all on this thread.
>
> -- Forwarded message --
> From: Rick Morit
used JDK 7. Or some later repackaging process ran on the
> artifacts and used Java 6. I do see "Build-Jdk: 1.6.0_45" in the
> manifest, but I don't think 1.4.x can compile with Java 6.
>
> On Tue, Aug 25, 2015 at 9:59 PM, Rick Moritz wrote:
> > A quick question r
A quick question regarding this: how come the artifacts (spark-core in
particular) on Maven Central are built with JDK 1.6 (according to the
manifest), if Java 7 is required?
On Aug 21, 2015 5:32 PM, "Sean Owen" wrote:
> Spark 1.4 requires Java 7.
>
> On Fri, Aug 21, 2015, 3:12 PM Chen Song wrot
oops, forgot to reply-all on this thread.
-- Forwarded message --
From: Rick Moritz
Date: Wed, Aug 19, 2015 at 2:46 PM
Subject: Re: Strange shuffle behaviour difference between Zeppelin and
Spark-shell
To: Igor Berman
Those values are not explicitely set, and attempting to read
19 August 2015 at 09:49, Rick Moritz wrote:
>
>> Dear list,
>>
>> I am observing a very strange difference in behaviour between a Spark
>> 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin
>> interpreter (compiled with Java 6 and sourced from
any effect on shuffling.
On Wed, Aug 19, 2015 at 8:49 AM, Rick Moritz wrote:
> Dear list,
>
> I am observing a very strange difference in behaviour between a Spark
> 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin
> interpreter (compiled with Java 6 and sou
spark-submit it
using different spark-binaries to further explore the issue.
Best Regards,
Rick Moritz
PS: I already tried to send this mail yesterday, but it never made it onto
the list, as far as I can tell -- I apologize should anyone receive this as
a second copy.
it it
using different spark-binaries to further explore the issue.
Best Regards,
Rick Moritz
Consider the spark.max.cores configuration option -- it should do what you
require.
On Tue, Aug 11, 2015 at 8:26 AM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:
> Hello all,
>
> As a quick follow up for this, I have been using Spark on Yarn till now
> and am currently exploring Me
Dear List,
I'm trying to reference a lonely message to this list from March 25th,(
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html
), but I'm unsure this will thread properly. Sorry, if didn't work out.
Anyway, using Spark 1.4.0-RC4 I run into the same issu
27 matches
Mail list logo