by col1,col2,col3,col4,col5").cache
>
> df_base.registerTempTable("df_base")
>
> val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group
> by col1, col2")
>
> val df2 = // similar logic
>
> Yong
> --
> *From
Hi,
I have read 5 columns from parquet into data frame. My queries on the
parquet table is of below type:
val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby
col1,col2)
val df2 = sqlContext.sql(select col1,col3,count(*) from table groupby
col1,col3)
val df3 =
Hi,
Does groupByKey has intelligence associated with it, such that if all the
keys resides in the same partition, it should not do the shuffle?
Or user should write mapPartitions( scala groupBy code).
Which would be more efficient and what are the memory considerations?
Thanks
Hi,
An Update on above question: In Local[*] mode code is working fine. The
Broadcast size is 200MB, but on Yarn it the broadcast join is giving empty
result.But in Sql Query in UI, it does show BroadcastHint.
Thanks
On Fri, Dec 30, 2016 at 9:15 PM, titli batali wrote:
Hi,
I would like to know that if Spark has support for Projection Pushdown
and Predicate Pushdown in Parquet for nested column.?
I can see two JIRA tasks with PR.
https://issues.apache.org/jira/browse/SPARK-17636
https://issues.apache.org/jira/browse/SPARK-4502
If not, are we seeing these
Hi,
We need to query deeply nested Json structure. However query is on a single
field at a nested level such as mean, median, mode.
I am aware of the sql explode function.
df = df_nested.withColumn('exploded', explode(top))
But this is too slow.
Is there any other strategy that could give us
Hi,
I would appreciate some suggestions on how to achieve top level struct
treatment to nested JSON when stored in Parquet format. Or any other
solutions for best performance using Spark 2.1.
Thanks in advance
On Mon, Jul 24, 2017 at 4:11 PM, Patrick <titlibat...@gmail.com> wrote:
>
Hi ,
I am having the same issue. Has any one found solution to this.
When i convert the nested JSON to parquet. I dont see the projection
working correctly.
It still reads all the nested structure columns. Parquet does support
nested column projection.
Does Spark 2 SQL provide the column
Hi,
On reading a complex JSON, Spark infers schema as following:
root
|-- header: struct (nullable = true)
||-- deviceId: string (nullable = true)
||-- sessionId: string (nullable = true)
|-- payload: struct (nullable = true)
||-- deviceObjects: array (nullable = true)
||
To avoid confusion, the query i am referring above is over some numeric
element inside *a: struct (nullable = true).*
On Mon, Jul 24, 2017 at 4:04 PM, Patrick <titlibat...@gmail.com> wrote:
> Hi,
>
> On reading a complex JSON, Spark infers schema as following:
>
> root
Hi,
A lot of code base of Spark is based on Builder Pattern, so i was wondering
what are the benefits that Builder Pattern brings to spark.
Some of things that comes in my mind, it is easy on garbage collection and
also user friendly API's.
Are their any other advantages with code running on
Hi
I have two lists:
- List one: contains names of columns on which I want to do aggregate
operations.
- List two: contains the aggregate operations on which I want to perform
on each column eg ( min, max, mean)
I am trying to use spark 2.0 dataset to achieve this. Spark provides
Ah, does it work with Dataset API or i need to convert it to RDD first ?
On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler <georg.kf.hei...@gmail.com>
wrote:
> What about the rdd stat counter? https://spark.apache.org/docs/
> 0.6.2/api/core/spark/util/StatCounter.html
>
>
on the
particular column.
I was thinking if we need to write some custom code which does this in one
action(job) that would work for me
On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler <georg.kf.hei...@gmail.com>
wrote:
> Rdd only
> Patrick <titlibat...@gmail.com> schrieb am Mo. 28.
Hi,
We were getting OOM error when we are accumulating the results of each
worker. We were trying to avoid collecting data to driver node instead used
accumulator as per below code snippet,
Is there any spark config to set the accumulator settings Or am i doing the
wrong way to collect the huge
Hi Spark Users,
I am trying to solve a class imbalance problem, I figured out, spark
supports setting weight in its API but I get IIlegal Argument exception
weight column do not exist, but it do exists in the dataset. Any
recommedation to go about this problem ? I am using Pipeline API with
- Patrick
On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko sparhome...@gmail.com wrote:
Hi Patrick,
Thanks for the patch. I tried building a patched version of
spark-core_2.10-0.9.0-incubating.jar but the Maven build fails:
[ERROR]
/home/das/Work/thx/incubator-spark/core/src/main/scala/org
The difference between your two jobs is that take() is optimized and
only runs on the machine where you are using the shell, whereas
sortByKey requires using many machines. It seems like maybe python
didn't get upgraded correctly on one of the slaves. I would look in
the /root/spark/work/ folder
on the workers machines. If you see stderr but not stdout
that's a bit of a puzzler since they both go through the same
mechanism.
- Patrick
On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com wrote:
Hi
I have some System.out.println in my Java code that is working ok in a local
environment
Hey Sen,
Suarav is right, and I think all of your print statements are inside of the
driver program rather than inside of a closure. How are you running your
program (i.e. what do you run that starts this job)? Where you run the
driver you should expect to see the output.
- Patrick
On Mon, Mar
change so it won't help the ulimit problem.
This means you'll have to use fewer reducers (e.g. pass reduceByKey a
number of reducers) or use fewer cores on each machine.
- Patrick
On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
matthew.c.ch...@gmail.com wrote:
Hi everyone,
My team (cc'ed
itself and override getPreferredLocations.
Keep in mind this is tricky because the set of executors might change
during the lifetime of a Spark job.
- Patrick
On Thu, Mar 13, 2014 at 11:50 AM, David Thomas dt5434...@gmail.com wrote:
Is it possible to parition the RDD elements in a round robin
This is not released yet but we're planning to cut a 0.9.1 release
very soon (e.g. most likely this week). In the mean time you'll have
checkout branch-0.9 of Spark and publish it locally then depend on the
snapshot version. Or just wait it out...
On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu
... but that's not quite
released yet :)
- Patrick
On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote:
i currently typically do something like this:
scala val rdd = sc.parallelize(1 to 10)
scala import com.twitter.algebird.Operators._
scala import com.twitter.algebird.{Max, Min
if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.
- Patrick
On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote:
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.
- Patrick
On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
There is no direct way to get this in pyspark, but you can get it from the
underlying java
Starting with Spark 0.9 the protobuf dependency we use is shaded and
cannot interfere with other protobuf libaries including those in
Hadoop. Not sure what's going on in this case. Would someone who is
having this problem post exactly how they are building spark?
- Patrick
On Fri, Mar 21, 2014
I'm not sure exactly how your cluster is configured. But as far as I can
tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find
the exact CDH version you have and link against the `mr1` version of their
published dependencies in that version.
So I think you wan't
to
the respective cassandra columns. I think all of this would be fairly easy
to implement on SchemaRDD and likely will make it into Spark 1.1
- Patrick
On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote:
Great work guys! Have been looking forward to this . . .
In the blog it mentions
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark
applications can persist their state so that the UI can be reloaded after
they have completed.
- Patrick
On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote:
Is there a way to see 'Application
Also in NYC, definitely interested in a spark meetup!
Sent from my iPhone
On Mar 31, 2014, at 3:07 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Happy to help with an NYC meet up (just emailed Andy). I recently moved to
VA, but am back in NYC quite often, and have been turning several
dependencies including the exact Spark version and other libraries.
- Patrick
On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote:
I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
issue. any word on this one?
On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal
Do you get the same problem if you build with maven?
On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote:
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly
That's all I do.
On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote:
Vidal - could you show
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies:
No assembly descriptors found. - [Help 1]
upon runnning
mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly
On Apr 1, 2014, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote:
Do you get the same
For textFile I believe we overload it and let you set a codec directly:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
For saveAsSequenceFile yep, I think Mark is right, you need an option.
On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra
The driver stores the meta-data associated with the partition, but the
re-computation will occur on an executor. So if several partitions are
lost, e.g. due to a few machines failing, the re-computation can be striped
across the cluster making it fast.
On Wed, Apr 2, 2014 at 11:27 AM, David
of functionality and something we might, e.g.
want to change the API of over time.
- Patrick
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.comwrote:
What I'd like is a way to capture the information provided on the stages
page (i.e. cluster:4040/stages via IndexPage). Looking
and on jobs that crunch hundreds of
terabytes (uncompressed) of data.
- Patrick
On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote:
Spark community,
What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes
in
the community has feedback from trying this.
- Patrick
On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote:
Hi Christophe,
Thanks for your reply and the spec file. I have solved my issue for now.
I didn't want to rely building spark using the spec file (%build
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:
I am running the latest version of PySpark branch-0.9 and having some
trouble with join.
One RDD is about 100G (25GB compressed and serialized in memory) with
130K records, the other RDD is about 10G (2.5G
:
Hey Patrick,
I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to
track this request, in case the team/community wants to implement it in the
future.
Nick
On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
No use case at the moment
Pierre - I'm not sure that would work. I just opened a Spark shell and did
this:
scala classOf[SparkContext].getClass.getPackage.getImplementationVersion
res4: String = 1.7.0_25
It looks like this is the JVM version.
- Patrick
On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans
pierre.borckm
I've actually done it using PySpark and python libraries which call cuda code,
though I've never done it from scala directly. The only major challenge I've
hit is assigning tasks to gpus on multiple gpu machines.
Sent from my iPhone
On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa
To reiterate what Tom was saying - the code that runs inside of Spark on
YARN is exactly the same code that runs in any deployment mode. There
shouldn't be any performance difference once your application starts
(assuming you are comparing apples-to-apples in terms of hardware).
The differences
I put some notes in this doc:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan
sinchronized.a...@gmail.com wrote:
I would like to run some of the tests selectively. I am in branch-1.0
Tried the following two
For a HadoopRDD, first the spark scheduler calculates the number of tasks
based on input splits. Usually people use this with HDFS data so in that
case it's based on HDFS blocks. If the HDFS datanodes are co-located with
the Spark cluster then it will try to run the tasks on the data node that
Try running sbt/sbt clean and re-compiling. Any luck?
On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote:
occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
2.found Exception:
the error first before the reader knows what is
going on.
Anyways maybe if you have a simpler solution you could sketch it out in the
JIRA and we could talk over there. The current proposal in the JIRA is
somewhat complicated...
- Patrick
On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl
What about if you run ./bin/spark-shell
--driver-class-path=/path/to/your/jar.jar
I think either this or the --jars flag should work, but it's possible there
is a bug with the --jars flag when calling the Repl.
On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote:
A
You can also accomplish this by just having a separate service that submits
multiple jobs to a cluster where those jobs e.g. use different jars.
- Patrick
On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote:
For the second question, you can submit multiple jobs through
Is this the serialization throughput per task or the serialization
throughput for all the tasks?
On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote:
Hi
I am running a WordCount program which count words from HDFS, and I
noticed that the serializer part of code
This class was made to be java friendly so that we wouldn't have to
use two versions. The class itself is simple. But I agree adding java
setters would be nice.
On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote:
There is a JavaSparkContext, but no JavaSparkConf object. I
You are right, once you sort() the RDD, then yes it has a well defined ordering.
But that ordering is lost as soon as you transform the RDD, including
if you union it with another RDD.
On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote:
Hi Patrick,
I¹m a little confused
This is a consequence of the way the Hadoop files API works. However,
you can (fairly easily) add code to just rename the file because it
will always produce the same filename.
(heavy use of pseudo code)
dir = /some/dir
rdd.coalesce(1).saveAsTextFile(dir)
f = new File(dir + part-0)
with many
partitions, since often there are bottlenecks at the granularity of a
file.
Is there a reason you need this to be exactly one file?
- Patrick
On Sat, May 3, 2014 at 4:14 PM, Chris Fregly ch...@fregly.com wrote:
not sure if this directly addresses your issue, peter, but it's worth
mentioned
your spark-ec2.py script to checkout spark-ec2 from forked version.
- Patrick
On Thu, May 1, 2014 at 2:14 PM, Ian Ferreira ianferre...@hotmail.com wrote:
Is this possible, it is very annoying to have such a great script, but still
have to manually update stuff afterwards.
Broadcast variables need to fit entirely in memory - so that's a
pretty good litmus test for whether or not to broadcast a smaller
dataset or turn it into an RDD.
On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma scrapco...@gmail.com wrote:
I had like to be corrected on this but I am just trying
Hey Jeremy,
This is actually a big problem - thanks for reporting it, I'm going to
revert this change until we can make sure it is backwards compatible.
- Patrick
On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi all,
A heads up in case others hit
PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Jeremy,
This is actually a big problem - thanks for reporting it, I'm going to
revert this change until we can make sure it is backwards compatible.
- Patrick
On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote
)
Is this the best way to go?
Best regards,
Patrick
to be almost identical to the final release.
- Patrick
On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote:
Can anyone comment on the anticipated date or worse case timeframe for when
Spark 1.0.0 will be released?
--
View this message in context:
http://apache-spark-user-list
Hello,
I'm trying to write a python function that does something like:
def foo(line):
try:
return stuff(line)
except Exception:
raise MoreInformativeException(line)
and then use it in a map like so:
rdd.map(foo)
and have my MoreInformativeException make it back if/when
)
- Patrick
On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote:
i have some settings that i think are relevant for my application. they are
spark.akka settings so i assume they are relevant for both executors and my
driver program.
i used to do:
SPARK_JAVA_OPTS
Note that since release artifacts were posted recently, certain
mirrors may not have working downloads for a few hours.
- Patrick
, Patrick Wendell pwend...@gmail.com
mailto:pwend...@gmail.com wrote:
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
is a milestone release as the first in the 1.0 line of releases,
providing API stability for Spark's core interfaces.
Spark 1.0.0 is Spark's
to make them
compatible with 2.6 we should do that.
For r3.large, we can add that to the script. It's a newer type. Any
interest in contributing this?
- Patrick
On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:
Hi there! I'm relatively new to the list, so sorry
Can you look at the logs from the executor or in the UI? They should
give an exception with the reason for the task failure. Also in the
future, for this type of e-mail please only e-mail the user@ list
and not both lists.
- Patrick
On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma
.
- Patrick
On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:
Hi,
How can I dispose an Accumulator?
It has no method like 'unpersist()' which Broadcast provides.
Thanks.
Currently, an executor is always run in it's own JVM, so it should be
possible to just use some static initialization to e.g. launch a
sub-process and set up a bridge with which to communicate.
This is would be a fairly advanced use case, however.
- Patrick
On Thu, May 29, 2014 at 8:39 PM
the change.
- Patrick
1) Is there a guarantee that a partition will only be processed on a node
which is in the getPreferredLocations set of nodes returned by the RDD ?
No there isn't, by default Spark may schedule in a non preferred
location after `spark.locality.wait` has expired.
this (this is pseudo-code):
files = fs.listStatus(s3n://bucket/stuff/*.gz)
files = files.filter(not the bad file)
fileStr = files.map(f = f.getPath.toString).mkstring(,)
sc.textFile(fileStr)...
- Patrick
On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
YES, your
Hey just to clarify this - my understanding is that the poster
(Jeremey) was using a custom AMI to *launch* spark-ec2. I normally
launch spark-ec2 from my laptop. And he was looking for an AMI that
had a high enough version of python.
Spark-ec2 itself has a flag -a that allows you to give a
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies that are published using classifiers. I'm pretty sure
mesos is the only dependency in Spark that is using classifiers, so
that's why I mention it.
On Sun,
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350
On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote:
One potential issue here is that mesos is using classifiers now to
publish there jars. It might be that sbt-pack has trouble with
dependencies
..
-Simon
On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote:
I would agree with your guess, it looks like the yarn library isn't
correctly finding your yarn-site.xml file. If you look in
yarn-site.xml do you definitely the resource manager
address/addresses?
Also, you
.
-Simon
On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com
wrote:
As a debugging step, does it work if you use a single resource manager
with the key yarn.resourcemanager.address instead of using two named
resource managers? I wonder if somehow the YARN client can't
/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1
However, it would be very easy to add an option that allows preserving
the old behavior. Is anyone here interested in contributing that? I
created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-1993
- Patrick
On Mon, Jun 2
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think
I accidentally assigned myself way back when I created it). This
should be an easy fix.
On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, Patrick,
I think https://issues.apache.org/jira/browse/SPARK
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:
https://github.com/apache/spark/blob/master/make-distribution.sh#L102
Any luck if you use JDK 6 to compile?
data by mistake if they don't understand
the exact semantics.
2. It would introduce a third set of semantics here for saveAsXX...
3. It's trivial for users to implement this with two lines of code (if
output dir exists, delete it) before calling saveAsHadoopFile.
- Patrick
On Mon, Jun 2, 2014 at 2
/clobber an existing destination directory if it
exists, then fully over-write it with new data.
I'm fine to add a flag that allows (B) for backwards-compatibility
reasons, but my point was I'd prefer not to have (C) even though I see
some cases where it would be useful.
- Patrick
On Mon, Jun 2
. The standard installation guide didn't say
anything about java 7 and suggested to do -DskipTests for the build..
http://spark.apache.org/docs/latest/building-with-maven.html
So, I didn't see the warning message...
On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell pwend...@gmail.com wrote
, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com
wrote:
(B) Semantics in Spark 1.0 and earlier:
Do you mean 1.0 and later?
Option (B) with the exception-on-clobber sounds fine to me, btw. My use
pattern is probably common but not universal, and deleting user files is
indeed
You can set an arbitrary properties file by adding --properties-file
argument to spark-submit. It would be nice to have spark-submit also
look in SPARK_CONF_DIR as well by default. If you opened a JIRA for
that I'm sure someone would pick it up.
On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi
Hey, thanks a lot for reporting this. Do you mind making a JIRA with
the details so we can track it?
- Patrick
On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka
marek.wiewio...@gmail.com wrote:
Exactly the same story - it used to work with 0.9.1 and does not work
anymore with 1.0.0.
I ran tests
Hey There,
This is only possible in Scala right now. However, this is almost
never needed since the core API is fairly flexible. I have the same
question as Andrew... what are you trying to do with your RDD?
- Patrick
On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote:
Just
Hey Chirag,
Those init scripts are part of the Cloudera Spark package (they are
not in the Spark project itself) so you might try e-mailing their
support lists directly.
- Patrick
On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote:
I recently spun up an AWS cluster
):
https://github.com/pwendell/kafka-spark-example
You'll want to make an uber jar that includes these packages (run sbt
assembly) and then submit that jar to spark-submit. Also, I'd try
running it locally first (if you aren't already) just to make the
debugging simpler.
- Patrick
On Wed, Jun 4, 2014
If that's still an issue, one thing to try is just changing the name
of the cluster. We create groups that are identified with the cluster
name, and there might be something that just got screwed up with the
original group creation and AWS isn't happy.
- Patrick
On Wed, Jun 4, 2014 at 12:55 PM, Sam
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell.
On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov
oleg.proudni...@gmail.com wrote:
Thank you, Hassan!
On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote:
just use -Dspark.executor.memory=
--
View this
it work. I think it's being tracked by this JIRA:
https://issues.apache.org/jira/browse/HIVE-5733
- Patrick
On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Is there a repo somewhere with the code for the Hive dependencies
(hive-exec, hive-serde, hive-metastore
are not in the jar
because they go beyond the extended zip boundary `jar tvf` won't list
them.
- Patrick
On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote:
Moving over to the dev list, as this isn't a user-scope issue.
I just ran into this issue with the missing saveAsTestFile
Also I should add - thanks for taking time to help narrow this down!
On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote:
Paul,
Could you give the version of Java that you are building with and the
version of Java you are running with? Are they the same?
Just off
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA:
https://issues.apache.org/jira/browse/SPARK-2075
On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote:
Hi, Patrick --
Java 7 on the development machines:
» java -version
1 ↵
java version 1.7.0_51
I you run locally then Spark doesn't launch remote executors. However,
in this case you can set the memory with --spark-driver-memory flag to
spark-submit. Does that work?
- Patrick
On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui cuihengg...@gmail.com wrote:
Hi,
I'm trying to run the SimpleApp
Hey Jeremy,
This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
make a 1.0.1 release soon (this patch being one of the main reasons),
but if you are itching for this sooner, you can just checkout the head
of branch-1.0 and you will be able to use r3.XXX instances.
- Patrick
By the way, in case it's not clear, I mean our maintenance branches:
https://github.com/apache/spark/tree/branch-1.0
On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey Jeremy,
This is patched in the 1.0 and 0.9 branches of Spark. We're likely to
make a 1.0.1
will be present in the 1.0 branch of Spark.
- Patrick
On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
I am about to spin up some new clusters, so I may give that a go... any
special instructions for making them work? I assume I use the
--spark-git-repo= option
1 - 100 of 338 matches
Mail list logo