If you are using a pairrdd, then you can use partition by method to provide
your partitioner
On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
What is re-partition ?
On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote:
In my understanding you need to create a
You can simply use a custom inputformat (AccumuloInputFormat) with the
hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to
pass the jobConfs. Here's pretty clean discussion:
Hi,
When running a exprimental KMeans job for expriment, the Cached RDD is
original Points data.
I saw poor locality in Task details from WebUI. Almost one half of the
input of task is Network instead of Memory.
And Task with network input consumes almost the same time compare with the
task
Actually if you only have one machine, just use the Spark local mode.
Just download the Spark tarball, untar it, set master to local[N], where N
= number of cores. You are good to go. There is no setup of job tracker or
Hadoop.
On Mon, Apr 20, 2015 at 3:21 PM, haihar nahak harihar1...@gmail.com
With maven you could like:
mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core
Thanks
Best Regards
On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma i...@introo.me wrote:
Hi.
My usage is only about the spark core and hdfs, so no spark sql or
mlib or other components invovled.
I saw
These are pair RDDs (itemId, item) (itemId, listing).
What do you mean by re-partitioning of these RDDS ?
Now what you mean by your partitioner
Can you elaborate ?
On Tue, Apr 21, 2015 at 11:18 AM, ayan guha guha.a...@gmail.com wrote:
If you are using a pairrdd, then you can use partition by
It could be a similar issue as
https://issues.apache.org/jira/browse/SPARK-4300
Thanks
Best Regards
On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h 165612...@qq.com wrote:
Hi,
I am studying the RDD Caching function and write a small program to verify
it. I run the program in a Spark1.3.0
I think DStream.transform is the one that you are looking for.
Thanks
Best Regards
On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov evo.efti...@isecc.com wrote:
Is the only way to implement a custom partitioning of DStream via the
foreach
approach so to gain access to the actual RDDs comprising
Hi,
I would like to report on trying the first option proposed by Lan - putting
the log4j.properties file under the root of my application jar.
It doesn't look like it is working on in my case: submitting the
application to spark from the application code (not with spark-submit).
It seems that in
Without the rest of your code it's hard to make sense of errors. Why do you
need to use reflection?
Make sure you use the same Scala versions throughout and 2.10.4 is
recommended. That's still the official version for Spark, even though
provisional support for 2.11 exists.
Dean Wampler, Ph.D.
Solved
Looks like it's some incompatibility in the build when using -Phadoop-2.4 ,
made the distribution with -Phadoop-provided and that fixed the issue
On Tue, Apr 21, 2015 at 2:03 PM, Fernando O. fot...@gmail.com wrote:
Hi all,
I'm wondering if SparkPi works with hadoop HA (I guess
Here is an example using rows directly:
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema
Avro or parquet input would likely give you the best performance.
On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com
While running a my Spark Application over 1.3.0 with Scala 2.10.0 i
encountered
15/04/21 09:13:21 ERROR executor.Executor: Exception in task 7.0 in stage
2.0 (TID 28)
java.lang.UnsupportedOperationException: tail of empty list
at scala.collection.immutable.Nil$.tail(List.scala:339)
at
You can use
df.withColumn(a, df.b)
to make column a having the same value as column b.
On Mon, Apr 20, 2015 at 3:38 PM, ARose ashley.r...@telarix.com wrote:
In my Java application, I want to update the values of a Column in a given
DataFrame. However, I realize DataFrames are immutable, and
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
1.3. We should allow DataFrames in ALS.train. I will submit a patch.
You can use `ALS.train(training.rdd, ...)` for now as a workaround.
-Xiangrui
On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley jos...@databricks.com wrote:
This is https://issues.apache.org/jira/browse/SPARK-6231
Unfortunately this is pretty hard to fix as its hard for us to
differentiate these without aliases. However you can add an alias as
follows:
from pyspark.sql.functions import *
df.alias(a).join(df.alias(b), col(a.col1) == col(b.col1))
On
You can use the more verbose syntax:
d.groupBy(_1).agg(d(_1), sum(_1).as(sum_1), sum(_2).as(sum_2))
On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I would like rename a column after aggregation. In the following code, the
column name is SUM(_1#179), is
Hi Ayan,
If you want to use DataFrame, then you should use the Pipelines API
(org.apache.spark.ml.*) which will take DataFrames:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS
In the examples/ directory for ml/, you can find a MovieLensALS
On 21 Apr 2015, at 17:34, Richard Marscher
rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote:
- There are System.exit calls built into Spark as of now that could kill your
running JVM. We have shadowed some of the most offensive bits within our own
application to work around this.
Hi Emre, thanks for the help will have a look. Cheers!
On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello James,
Did you check the following resources:
-
https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming
-
Hello,
I would like rename a column after aggregation. In the following code, the
column name is SUM(_1#179), is there a way to rename it to a more
friendly name?
scala val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
scala d.groupBy(_1).sum().printSchema
root
|-- _1: integer
Some more info:
I'm putting the compressions values on hive-site.xml and running spark job.
hc.sql(set ) returns the expected (compression) configuration but
looking at the logs, it create the tables without compression:
15/04/21 13:14:19 INFO metastore.HiveMetaStore: 0: create_table:
We already do have a cron job in place to clean just the shuffle files.
However, what I would really like to know is whether there is a proper
way of telling spark to clean up these files once its done with them?
Thanks
NB
On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele gangele...@gmail.com
On Tuesday 21 April 2015 12:12 PM, Akhil Das wrote:
Your spark master should be spark://swetha:7077 :)
Thanks
Best Regards
On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com
mailto:madhvi.gu...@orkash.com wrote:
PFA screenshot of my cluster UI
Thanks
On Monday 20
We have the similar issue with massive parquet files, Cheng Lian, could you
have a look?
2015-04-08 15:47 GMT+08:00 Zheng, Xudong dong...@gmail.com:
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in
Hi!
I want to normalize features before train logistic regression. I setup scaler:
scaler2 = StandardScaler(withMean=True, withStd=True).fit(features)
and apply it to a dataset:
scaledData = dataset.map(lambda x: LabeledPoint(x.label,
scaler2.transform(Vectors.dense(x.features.toArray()
You need to call sc.stop() to wait for the notifications to be processed.
Best Regards,
Shixiong(Ryan) Zhu
2015-04-21 4:18 GMT+08:00 Praveen Balaji secondorderpolynom...@gmail.com:
Thanks Shixiong. I tried it out and it works.
If you're looking at this post, here a few points you may be
Hello Madvi,
Some work has been done by @pomadchin using the spark notebook, maybe you
should come on https://gitter.im/andypetrella/spark-notebook and poke him?
There are some discoveries he made that might be helpful to know.
Also you can poke @lossyrob from Azavea, he did that for geotrellis
Lately we upgraded our Spark to 1.3.
Not surprisingly, over the way I find few incomputability between the
versions and quite expected.
I found change that I'm interesting to understand it origin.
env: Amazon EMR, Spark 1.3, Hive 0.13, Hadoop 2.4
In Spark 1.2.1 I ran from the code query such:
Your spark master should be spark://swetha:7077 :)
Thanks
Best Regards
On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote:
PFA screenshot of my cluster UI
Thanks
On Monday 20 April 2015 02:27 PM, Akhil Das wrote:
Are you seeing your task being submitted to the UI?
BTW
This:
hc.sql(show tables).collect
Works great!
On Tue, Apr 21, 2015 at 10:49 AM, Ophir Cohen oph...@gmail.com wrote:
Lately we upgraded our Spark to 1.3.
Not surprisingly, over the way I find few incomputability between the
versions and quite expected.
I found change that I'm
I think maybe you need more partitions in your input, which might make
for smaller tasks?
On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone
christian.per...@gmail.com wrote:
I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large
*I am new to Spark world and Job Server
My Code :*
package spark.jobserver
import java.nio.ByteBuffer
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
import scala.collection.immutable.Map
import org.apache.cassandra.hadoop.ConfigHelper
import
Hi Archit,
Thanks a lot for your reply. I am using rdd.partitions.length to check
the number of partitions. rdd.partitions return the array of partitions.
I would like to add one more question here do you have any idea how to get
the objects in each partition ? Further is there any way to figure
Have you tried the following ?
import sqlContext._
import sqlContext.implicits._
Cheers
On Tue, Apr 21, 2015 at 7:54 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I tried to convert an RDD to a data frame using the example codes on
spark website
case class
I tried to convert an RDD to a data frame using the example codes on spark
website
case class Person(name: String, age: Int)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val people =
Sorry, my code actually was
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
But in Spark 1.4.0 this does not seem to make any difference anyway and
the problem is the same with both versions.
On 2015-04-21 17:04, ayan guha wrote:
your code should be
df_one =
The problem with k means is we have to define the no of cluster which I
dont want in this case
So thinking for something like hierarchical clustering any idea and
suggestions?
On 21 April 2015 at 20:51, Jeetendra Gangele gangele...@gmail.com wrote:
I have a requirement in which I want to
your code should be
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2 are
different, so join is yielding to cartesian.
Best
Ayan
On Wed, Apr 22, 2015 at 12:42 AM, Karlson ksonsp...@siberie.de wrote:
Are you looking for?
*mapPartitions*(*func*)Similar to map, but runs separately on each
partition (block) of the RDD, so *func* must be of type IteratorT =
IteratorU when running on an RDD of type T.*mapPartitionsWithIndex*(*func*
)Similar to mapPartitions, but also provides *func* with an
Does line 27 correspond to brdcst.value ?
Cheers
On Apr 21, 2015, at 3:19 AM, donhoff_h 165612...@qq.com wrote:
Hi, experts.
I wrote a very little program to learn how to use Broadcast Variables, but
met an exception. The program and the exception are listed as following.
Could
Hi,
I've written an application that performs some machine learning on some
data. I've validated that the data _should_ give a good output with a decent
RMSE by using Lib-SVM:
Mean squared error = 0.00922063 (regression)
Squared correlation coefficient = 0.9987 (regression)
When I try to use
Hi,
We are building a spark streaming application which reads from kafka, does
updateStateBykey based on the received message type and finally stores into
redis.
After running for few seconds the executor process get killed by throwing
OutOfMemory error.
The code snippet is below:
Hi
There are 2 ways of doing it.
1. Using SQL - this method directly creates another dataframe object.
2. Using methods of the DF object, but in that case you have to provide the
schema through a row object. In this case you need to explicitly call
createDataFrame again which will infer the
Hi,
This should work. How are you checking the no. of partitions.?
Thanks and Regards,
Archit Thakur.
On Mon, Apr 20, 2015 at 7:26 PM, mas mas.ha...@gmail.com wrote:
Hi,
I aim to do custom partitioning on a text file. I first convert it into
pairRDD and then try to use my custom
Hi
I am getting an error
Also, I am getting an error in mlib.ALS.train function when passing
dataframe (do I need to convert the DF to RDD?)
Code:
training = ssc.sql(select userId,movieId,rating from ratings where
partitionKey 6).cache()
print type(training)
model =
I have similar issue (I failed on the spark core project but with same
exception as you). Then I follow the below steps (I am working on windows):
Delete the maven repository, and re-download all dependency. The issue sounds
like a corrupted jar can’t be opened by maven.
Other than this,
Hi Shiyao,
From the same page you referred:Maven is the official recommendation for
packaging Spark, and is the “build of reference”. But SBT is supported for
day-to-day development since it can provide much faster iterative compilation.
More advanced developers may wish to use SBT.
For maven,
As a workaround, can you call getConf first before any setConf?
On Tue, Apr 21, 2015 at 1:58 AM, Ophir Cohen oph...@gmail.com wrote:
I think I encounter the same problem, I'm trying to turn on the
compression of Hive.
I have the following lines:
def initHiveContext(sc: SparkContext):
At this point I am assuming that nobody has an idea... I am still going to
give it a last shot just in case it was missed by some people :)
Thanks,
On Mon, Apr 20, 2015 at 2:20 PM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Hey, so I start the context at the very end when all the piping is
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!
Renato M.
2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:
Here is an example
Hi Denys,
I don't see any issue in your python code, so maybe there is a bug in
python wrapper. If it's in scala, I think it should work. BTW,
LogsticRegressionWithLBFGS does the standardization internally, so you
don't need to do it yourself. It worths giving it a try!
Sincerely,
DB Tsai
Try --executor-memory 5g , because you have 8 gb RAM in each machine
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Sourav,
Can you post your updateFunc as well please ?
Regards,
Olivier.
Le mar. 21 avr. 2015 à 12:48, Sourav Chandra sourav.chan...@livestream.com
a écrit :
Hi,
We are building a spark streaming application which reads from kafka, does
updateStateBykey based on the received message type
Thank you all.
On 22 Apr 2015 04:29, Xiangrui Meng men...@gmail.com wrote:
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in
1.3. We should allow DataFrames in ALS.train. I will submit a patch.
You can use `ALS.train(training.rdd, ...)` for now as a workaround.
-Xiangrui
I did some performance check on socLiveJournal PageRank b/w my local
machine (8 cores, 16 gb ) in local mode and my small cluster (4 nodes, 12
cores, 40 gb) and i found cluster mode is way faster than local mode. So I
confused.
no. of iterations --- Local mode(in mins) -- cluster mode(in mins)
1
Sure. But in general, I am assuming this Graph is unexpectedly null when
DStream is being serialized must mean something. Under which
circumstances, such an exception would trigger?
On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote:
Yeah, I am not sure what is going on.
It is kind of unexpected, i can imagine a real scenario under which it
should trigger. But obviously I am missing something :)
TD
On Tue, Apr 21, 2015 at 4:23 PM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Sure. But in general, I am assuming this Graph is unexpectedly null
when DStream is
Akhil, Thanks for the suggestions.
I tried out sc.addJar, --jars, --conf spark.executor.extraClassPath and
none of them helped. I added stuff into compute-classpath.sh. That did not
change anything. I checked the classpath of the running executor and made
sure that the hive jars are in that dir.
I am having a strange problem writing to s3 that I have distilled to this
minimal example:
def jsonRaw = s${outprefix}-json-raw
def jsonClean = s${outprefix}-json-clean
val txt = sc.textFile(inpath)//.coalesce(shards, false)
txt.count
val res = txt.saveAsTextFile(jsonRaw)
val txt2 =
Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table)
Suppose you have:
case class MyData(val0: Int, val1: string, directory_name:
Yeah, I am not sure what is going on. The only way to figure to take a look
at the disassembled bytecodes using javap.
TD
On Tue, Apr 21, 2015 at 1:53 PM, Jean-Pascal Billaud j...@tellapart.com
wrote:
At this point I am assuming that nobody has an idea... I am still going to
give it a last
Hi,
It should generate the same no of partitions as the no. of splits.
Howd you check no of partitions.? Also please paste your file size and
hdfs-site.xml and mapred-site.xml here.
Thanks and Regards,
Archit Thakur.
On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wenlei@gmail.com wrote:
Hi,
Hi, experts.
I wrote a very little program to learn how to use Broadcast Variables, but met
an exception. The program and the exception are listed as following. Could
anyone help me to solve this problem? Thanks!
**My Program is as following**
object TestBroadcast02 {
Hello James,
Did you check the following resources:
-
https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming
-
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
--
Emre Sevinç
Thanks for the hints guys! much appreciated!
Even if I just do a something like:
Select * from tableX where attribute1 5
I see similar behaviour.
@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something
I'm trying to write some unit tests for my spark code.
I need to pass a JavaPairDStreamString, String to my spark class.
Is there a way to create a JavaPairDStream using Java API?
Also is there a good resource that covers an approach (or approaches) for
unit testing using Java.
Regards
jk
Hi all,
I'm wondering if SparkPi works with hadoop HA (I guess it should)
Hadoop's pi example works great on my cluster, so after having that
done I installed spark and in the worker log I'm seeing two problems
that might be related.
Versions: Hadoop 2.6.0
Spark 1.3.1
I'm
you are correct. Just found the same thing. You are better off with sql,
then.
userSchemaDF = ssc.createDataFrame(userRDD)
userSchemaDF.registerTempTable(users)
#print userSchemaDF.take(10)
#SQL API works as expected
sortedDF = ssc.sql(SELECT userId,age,gender,work from users order
Could you possibly describe what you are trying to learn how to do in more
detail? Some basics of submitting programmatically:
- Create a SparkContext instance and use that to build your RDDs
- You can only have 1 SparkContext per JVM you are running, so if you need
to satisfy concurrent job
70 matches
Mail list logo