I will..that will be great if simple UDF can return complex type.
Thanks!
On Fri, Jun 5, 2015 at 12:17 AM, Cheng, Hao hao.ch...@intel.com wrote:
Confirmed, with latest master, we don't support complex data type for Simple
Hive UDF, do you mind file an issue in jira?
-Original
I figured it out. Needed a block style map and a check for null. The case
class is just to name the transformed columns.
case class Component(name: String, loadTimeMs: Long)
avroFile.filter($lazyComponents.components.isNotNull)
.explode($lazyComponents.components)
{ case
Please file a bug here: https://issues.apache.org/jira/browse/SPARK/
Could you also provide a way to reproduce this bug (including some datasets)?
On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote:
I've changed the SIFT feature extraction to SURF feature extraction and
Hi i am using spark 1.3.1 and i am trying to submit spark job to the rest
url of spark ( spark:host-name:6066 using standalone cluster of spark
with deploy mode as cluster . Both driver and application and getting
submitted after doing their work (output created) both end up with killed
status.
It is Spark 1.3.1.e (it is AWS release .. I think it is close to Spark
1.3.1 with some bug fixes).
My report about GenericUDF not working in SparkSQL is wrong. I tested
with open-source GenericUDF and it worked fine. Just my GenericUDF
which returns Map type didn't work. Sorry about false
You could try adding the configuration in the spark-defaults.conf file. And
once you run the application you can actually check on the driver UI (runs
on 4040) Environment tab to see if the configuration is set properly.
Thanks
Best Regards
On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel
I have a stage that spawns 174 tasks when i run repartition on avro data.
Tasks read between 512/317/316/214/173 MB of data. Even if i increase
number of executors/ number of partitions (when calling repartition) the
number of tasks launched remains fixed to 174.
1) I want to speed up this
I've changed the SIFT feature extraction to SURF feature extraction and it
works...
Following line was changed:
sift = cv2.xfeatures2d.SIFT_create()
to
sift = cv2.xfeatures2d.SURF_create()
Where should I file this as a bug? When not running on Spark it works fine
so I'm saying it's a spark
Confirmed, with latest master, we don't support complex data type for Simple
Hive UDF, do you mind file an issue in jira?
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Friday, June 5, 2015 12:35 PM
To: ogoh; user@spark.apache.org
Subject: RE: SparkSQL : using
Hi,
I has this problem before, and in my case it is because the
executor/container was killed by yarn when it used more memory than
allocated. You can check if your case is the same by checking yarn node
manager log.
Best,
Patcharee
On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
I see this
Hi list!
My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket/file1.csv).count
val c2 =
sc.textFile(s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket/file1.csv).count
val c3 =
Hi list!
My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv).count
val c2 =
sc.textFile(s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket2/file.csv).count
val c3 =
Yea should have emphasized that. I'm running the same code on the same VM.
It's a VM with spark in standalone mode and I run the unit test directly on
that same VM. So OpenCV is working correctly on that same machine but when
moving the exact same OpenCV code to spark it just crashes.
On Tue, Jun
We currently have data in avro format and we do joins between avro and
sequence file data.
Will storing these datasets in Parquet make joins any faster ?
The dataset sizes are beyond are between 500 to 1000 GB.
--
Deepak
Hi,
I am looking for some articles/blogs on the topic about how spark handles the
various failures,such as Driver,Worker,Executor, Task..etc
Are there some articles/blogs on this topic? Detailes into source code would be
the best.
Thanks very much!
bit1...@163.com
Did you have a change of the value of 'spark.default.parallelism'?be a
bigger number.
2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com:
It may be that your system runs out of resources (ie 174 is the ceiling)
due to the following
1. RDD Partition = (Spark) Task
2.
Would tachyon be appropriate here?
On Friday, June 5, 2015, Evo Eftimov evo.efti...@isecc.com wrote:
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
Batch Jobs (besides anyone can put something like that in 5 min), while I
am under the impression that Dmytiy is
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
Batch Jobs (besides anyone can put something like that in 5 min), while I am
under the impression that Dmytiy is working on Spark Streaming app
Besides the Job Server is essentially for sharing the Spark Context
Another option is merge partfiles after your app ends.
On 5 Jun 2015 20:37, Akhil Das ak...@sigmoidanalytics.com wrote:
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.
Thanks
Best
I did not change spark.default.parallelism,
What is recommended value for it.
On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote:
Did you have a change of the value of 'spark.default.parallelism'?be a
bigger number.
2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com:
It
On 5 Jun 2015, at 08:03, Pierre B pierre.borckm...@realimpactanalytics.com
wrote:
Hi list!
My problem is quite simple.
I need to access several S3 buckets, using different credentials.:
```
val c1 =
sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv).count
val c2
On 2 Jun 2015, at 00:14, Dean Wampler
deanwamp...@gmail.commailto:deanwamp...@gmail.com wrote:
It would be nice to see the code for MapR FS Java API, but my google foo failed
me (assuming it's open source)...
I know that MapRFS is closed source, don't know about the java JAR. Why not ask
just multiply 2-4 with the cpu core number of the node .
2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
I did not change spark.default.parallelism,
What is recommended value for it.
On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote:
Did you have a change of the
Hi all
I'm running spark in a single local machine, no hadoop, just reading and
writing in local disk.
I need to have a single file as output of my calculation.
if I do rdd.saveAsTextFile(...) all runs ok but I get allot of files.
Since I need a single file I was considering to do something
Just repartition to 1 partition before writing.
On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote:
Hi all
I'm running spark in a single local machine, no hadoop, just reading and
writing in local disk.
I need to have a single file as output of my calculation.
if I do
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote:
Hi all
I'm running spark in a single local
On 4 Jun 2015, at 15:59, Chao Chen kandy...@gmail.com wrote:
But when I try to run the Pagerank from HiBench, it always cause a node to
reboot during the middle of the work for all scala, java, and python
versions. But works fine
with the MapReduce version from the same benchmark.
do
That project is for reading data in from Redshift table exports stored in s3 by
running commands in redshift like this:
unload ('select * from venue')
to 's3://mybucket/tickit/unload/'
http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
The path in the parameters below is
And RDD.lookup() can not be invoked from Transformations e.g. maps
Lookup() is an action which can be invoked only from the driver – if you want
functionality like that from within Transformations executed on the cluster
nodes try Indexed RDD
Other options are load a Batch / Static RDD
Hi,
I am using cassandraDB in my project. I had that error *Exception in thread
main java.io.IOException: Failed to open native connection to Cassandra
at {127.0.1.1}:9042*
I think I have to modify the submit line. What should I add or remove when
I submit my project?
Best,
yasemin
--
hiç
Have you tried putting this file on local disk on each of executor
nodes? That worked for me.
On 05.06.2015 16:56, nib...@free.fr wrote:
Hello,
I want to override the log4j configuration when I start my spark job.
I tried :
.../bin/spark-submit --class --conf
Would the IndexedRDD feature provide what the Lookup RDD does?
I'Ve been using a broadcast variable map for a similar kind of thing -- It
probably is within 1GB but interested to know if the lookup (or indexed)
might be better.
C
On Friday, June 5, 2015, Dmitry Goldenberg dgoldenberg...@gmail.com
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like
Thanks Davies. I will file a bug later with code and single image as
dataset. Next to that I can give anybody access to my vagrant VM that
already has spark with OpenCV and the dataset available.
Or you can setup the same vagrant machine at your place. All is automated ^^
git clone
Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I
can't seem to locate it exactly on Github. (Yes, to your point, our project
is Spark streaming based). Thank you.
On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Oops, @Yiannis, sorry to be a
Hi All,
I want to read and write data to aws redshift. I found spark-redshift
project at following address.
https://github.com/databricks/spark-redshift
in its documentation there is following code is written.
import com.databricks.spark.redshift.RedshiftInputFormat
val records =
Hi all,
what is the best way to perform Spark SQL queries and obtain the result
tuplas in a stremaing way. In particullar, I want to aggregate data and
obtain the first and incomplete results in a fast way. But it should be
updated until the aggregation be completed.
Best Regards.
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the functionality in my client
code without patching Spark? I tried making my own saveFunc function and
calling DStream.foreachRDD but ran into trouble with invoking rddToFileName
Hi Heather,
Please check this issue https://issues.apache.org/jira/browse/SPARK-4672. I
think you can solve this problem by checkpointing your data every several
iterations.
Hope that helps.
Best regards,
Baoxu(Dash) Shi
Computer Science and Engineering Department
University of Notre Dame
Hi all,
I'm having some issues finding any kind of best practices when attempting
to create Spark applications which launch jobs from a thread pool.
Initially I had issues passing the SparkContext to other threads as it is
not serializable. Eventually I found that adding the @transient
On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote:
Initially I had issues passing the SparkContext to other threads as it is
not serializable. Eventually I found that adding the @transient annotation
prevents a NotSerializableException.
This is really puzzling. How are
You can see an example of the constructor for the class which executes a
job in my opening post.
I'm attempting to instantiate and run the class using the code below:
```
val conf = new SparkConf()
.setAppName(appNameBase.format(Test))
val connector = CassandraConnector(conf)
Hello!
I am working a column of Maps with dataframes, and I was wondering if there
was a good way of removing a set of keys and their associated values from
that columns. I've been using a UDF, but if there was some built in function
that I'm missing, I'd love to know.
Thanks!
--
View this
+1 to question about serializaiton. SparkContext is still in driver
process(even if it has several threads from which you submit jobs)
as for the problem, check your classpath, scala version, spark version etc.
such errors usually happens when there is some conflict in classpath. Maybe
you
I am using Spark 1.3.1. So I don't have the 1.4.0 isEmpty. I guess I am
curious on the right approach here, like I said in my original post,
perhaps this isn't bad but I the exceptions I guess bother me from a
programmer level... is that wrong? :)
On Fri, Jun 5, 2015 at 11:07 AM, Ted Yu
John:
Which Spark release are you using ?
As of 1.4.0, RDD has this method:
def isEmpty(): Boolean = withScope {
FYI
On Fri, Jun 5, 2015 at 9:01 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Foreachpartition callback is provided with Iterator by the Spark Frameowrk
– while
Solved, is SPARK_PID_DIR from spark-env.sh. Changing this directory from /tmp
to smthg different actually changed the error that I got, now showing where
the actual error was coming from (a null pointer in my program). The first
error was not helpful at all though.
--
View this message in
Please ignore this whole thread. It's working out of nowhere. I'm not sure
what was the root cause. After I restarted the VM the previous SIFT code
also started working.
On Fri, Jun 5, 2015 at 10:40 PM, Sam Stoelinga sammiest...@gmail.com
wrote:
Thanks Davies. I will file a bug later with code
Hi all,
I'm using spark 1.3.1 and ran the following code:
sc.textFile(path)
.map(line = (getEntId(line), line))
.persist(StorageLevel.MEMORY_AND_DISK)
.groupByKey
.flatMap(x = func(x))
.reduceByKey((a,b) = (a + b).toShort)
I get the following error in
Foreachpartition callback is provided with Iterator by the Spark Frameowrk –
while iterator.hasNext() ……
Also check whether this is not some sort of Python Spark API bug – Python seems
to be the foster child here – Scala and Java are the darlings
From: John Omernik
It may be that your system runs out of resources (ie 174 is the ceiling) due to
the following
1. RDD Partition = (Spark) Task
2. RDD Partition != (Spark) Executor
3. (Spark) Task != (Spark) Executor
4. (Spark) Task = JVM Thread
5. (Spark) Executor = JVM
The param is for “Default number of partitions in RDDs returned by
transformations like join, reduceByKey, and parallelize when NOT set by user.”
While Deepak is setting the number of partitions EXPLICITLY
From: 李铖 [mailto:lidali...@gmail.com]
Sent: Friday, June 5, 2015 11:08 AM
To:
Spark uses Tachyon internally ie all SERIALIZED IN-MEMORY RDDs are kept there –
so if you have a BATCH RDD which is SERIALIZED IN_MEMORY then you are using
Tachyon implicitly – the only difference is that if you are using Tachyon
explicitly ie as a distributed, in-memory file system you can
If your problem is that stopping/starting the cluster resets configs, then
you may be running into this issue:
https://issues.apache.org/jira/browse/SPARK-4977
Nick
On Thu, Jun 4, 2015 at 2:46 PM barmaley o...@solver.com wrote:
Hi - I'm having similar problem with switching from ephemeral to
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos?
in yarn-cluster the driver program is executed inside one of nodes in
cluster, so might be that driver code needs to be serialized to be sent to
some node
On 5 June 2015 at 22:55, Lee McFadden splee...@gmail.com wrote:
Ignoring the serialization thing (seems like a red herring):
On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote:
15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.NoSuchMethodError:
I'm running PageRank on datasets with different sizes (from 1GB to 100GB).
Sometime my job is aborted showing this error:
Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times,
most recent failure: Lost task 0.3 in stage 4.1 (TID 2051,
9.12.247.250): java.io.FileNotFoundException:
On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin van...@cloudera.com wrote:
Ignoring the serialization thing (seems like a red herring):
People seem surprised that I'm getting the Serialization exception at all -
I'm not convinced it's a red herring per se, but on to the blocking issue...
On Fri, Jun 5, 2015 at 12:55 PM, Lee McFadden splee...@gmail.com wrote:
Regarding serialization, I'm still confused as to why I was getting a
serialization error in the first place as I'm executing these Runnable
classes from a java thread pool. I'm fairly new to Scala/JVM world and
there
Thanks for let us now.
On Fri, Jun 5, 2015 at 8:34 AM, Sam Stoelinga sammiest...@gmail.com wrote:
Please ignore this whole thread. It's working out of nowhere. I'm not sure
what was the root cause. After I restarted the VM the previous SIFT code
also started working.
On Fri, Jun 5, 2015 at
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all variables in
a closure used on an RDD are Serializable - I hate linking to Quora, but
there's a good explanation here:
Hi Yin,
Thanks for the suggestion.
I’m not happy about this, and I don’t agree with your position that since it
wasn’t an “officially” supported feature
no harm was done breaking it in the course of implementing SPARK-6908. I would
still argue that it changed
and therefore broke .table()’s
Thanks all. The answers post is me too, I multi thread. That and Ted is
aware to and Mapr is helping me with it. I shall report the answer of that
investigation when we have it.
As to reproduction, I've installed mapr file system, tired both version
4.0.2 and 4.1.0. Have mesos running along
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman igor.ber...@gmail.com wrote:
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos?
Spark standalone, v1.2.1.
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin van...@cloudera.com wrote:
You didn't show the error so the only thing we can do is speculate. You're
probably sending the object that's holding the SparkContext reference over
the network at some point (e.g. it's used by a task run in an
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote:
Your lambda expressions on the RDDs in the SecondRollup class are closing
around the context, and Spark has special logic to ensure that all
variables in a closure used on an RDD are Serializable - I hate linking to
Quora,
Hi Doug,
For now, I think you can use sqlContext.sql(USE databaseName) to change
the current database.
Thanks,
Yin
On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai yh...@databricks.com wrote:
Hi Doug,
sqlContext.table does not officially support database name. It only
supports table name as the
It seems like there is another thread going on:
http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html
I'm not particularly sure why, seems like the problem is that getting the
current context class loader is returning null in this instance.
Do you have some
I'm running PageRank on datasets with different sizes (from 1GB to 100GB).
Sometime my job is aborted showing this error:
Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most
recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250):
java.io.FileNotFoundException:
There use to be a project, StreamSQL (
https://github.com/thunderain-project/StreamSQL), but it appears a bit
dated and I do not see it in the Spark repo, but may have missed it.
@TD Is this project still active?
I'm not sure what the status is but it may provide some insights on how to
achieve
You could take at RDD *async operations, their source code. May be that can
help if getting some early results.
TD
On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile
pietro.gentile89.develo...@gmail.com wrote:
Hi all,
what is the best way to perform Spark SQL queries and obtain the result
Maybe you could try to implement your own Partitioner. As I remember, by
default, Spark use HashPartitioner.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Can-you-specify-partitions-tp23156p23187.html
Sent from the Apache Spark User List mailing list
Check your spark.cassandra.connection.host setting. It should be pointing to
one of your Cassandra nodes.
Mohammed
From: Yasemin Kaya [mailto:godo...@gmail.com]
Sent: Friday, June 5, 2015 7:31 AM
To: user@spark.apache.org
Subject: Cassandra Submit
Hi,
I am using cassandraDB in my project. I
I was able to run my application by just using an hadoop/YARN cluster with
1 machine. Today I tried to extend the cluster to use one more machine, but
I got some problems on the yarn node manager of that new added machine:
Node Manager Log:
«
2015-06-06 01:41:33,379 INFO
Why not letting SparkSQL deal with parallelism? When using SparkSQL data
sources you can control parallelism by specifying mapred.min.split.size
and mapred.max.split.size in your Hadoop configuration. You can then
repartition your data as you wish and save it as Parquet.
--Hossein
On Thu, May
I am not sure. Saisai may be able to say more about it.
TD
On Fri, Jun 5, 2015 at 5:35 PM, Todd Nist tsind...@gmail.com wrote:
There use to be a project, StreamSQL (
https://github.com/thunderain-project/StreamSQL), but it appears a bit
dated and I do not see it in the Spark repo, but may
I use a simple python to launch cluster. I just did itfor fun, so of course
not the best and lot ofmodifications can be done.But I think you arelooking
for something similar?
import subprocess as s
from time import sleep
cmd =
Thanks Ignor,
I managed to find a fairly simple solution. It seems that the shell scripts
(e.g. .start-master.sh, start-slave.sh) end up executing /bin/spark-class
which is always run in the foreground.
Here is a solution I provided on stackoverflow:
-
Is there pythonic/sparkonic way to test for an empty RDD before using the
foreachRDD? Basically I am using the Python example
https://spark.apache.org/docs/latest/streaming-programming-guide.html to
put records somewhere When I have data, it works fine, when I don't I
get an exception. I am not
79 matches
Mail list logo