https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Doesn't look like spark sql support nested complex types right now
--
View this message in context:
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with spark.
Table F1 join with table F2 on c1=d1.
All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But
it returns only 3437 rows
// ---
Hi,
I have the classic word count example:
file.flatMap(line = line.split( )).map(word = (word,1)).reduceByKey(_ +
_).collect()
From the Job UI, I can only see 2 stages: 0-collect and 1-map.
What happened to ShuffledRDD in reduceByKey? And both flatMap and map
operations is collapsed into a
Made progress but still blocked.
After recompiling the code on cmd instead of PowerShell, now I can see all 5
classes as you mentioned.
However I am still seeing the same error as before. Anything else I can check
for?
From: Judy Nash [mailto:judyn...@exchange.microsoft.com]
Sent: Monday,
Exactly that seems to be the problem will have to wait for the next release
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-string-map-string-int-in-spark-sql-tp19603p19734.html
Sent from the Apache Spark User List
Thanks a lot, both solutions work.
best,
/Shahab
On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to
increment them like so:
val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1)
Yes, and I prepared a basic talk on this exact topic. Slides here:
http://www.slideshare.net/srowen/anomaly-detection-with-apache-spark-41975155
This is elaborated in a chapter of an upcoming book that's available
in early release; you can look at the accompanying source code to get
some ideas
The main reason for the alpha tag is actually that APIs might still be
evolving, but we'd like to freeze the API as soon as possible. Hopefully it
will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data
source API that we'd like to get experience with before freezing it.
Computing will be triggered by new files added in the directory.
If you place new files to the directory and it will start training the
model.
2014-11-11 5:03 GMT+08:00 Bui, Tri tri@verizonwireless.com.invalid:
Hi,
The model weight is not updating for streaming linear regression. The
Spark SQL supports complex types, but casting doesn't work for complex
types right now.
On 11/25/14 4:04 PM, critikaled wrote:
https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Doesn't
Which version are you using? Or if you are using the most recent master
or branch-1.2, which commit are you using?
On 11/25/14 4:08 PM, david wrote:
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with
The case run correctly in my environment.
14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Model
updated at time 141690890 ms
14/11/25 17:48:20 INFO regression.StreamingLinearRegressionWithSGD: Current
model: weights, [0.8588]
Can you provide more detail
The problem was I didn't use the correct class name, it should
be org.apache.spark.*serializer*.KryoSerializer
On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I want to test Kryo serialization but when starting spark-shell I'm
hitting the following error:
I have generated a sparse matrix by python, which has the size of
4000*174000 (.pkl), the following is a small part of this matrix :
(0, 45) 1 (0, 413) 1 (0, 445) 1 (0, 107) 4 (0, 80) 2 (0, 352) 1 (0, 157)
1 (0, 191) 1 (0, 315) 1 (0, 395) 4 (0, 282) 3 (0, 184) 1 (0, 403) 1 (0,
Hello,
I have a key value pair, whose value is an ArrayList and I would like to
move one value of the ArrayList to the key position and the key back into
the ArrayList. Is it possible to do tis with java lambda expression?
This workes in python:
newMap = sourceMap.map(lambda (key,((value1,
Hey Experts,
I wanted to understand in detail about the lifecycle of rdd(s) in a
streaming app.
From my current understanding
- rdd gets created out of the realtime input stream.
- Transform(s) functions are applied in a lazy fashion on the RDD to
transform into another rdd(s).
- Actions are
Hi,
While submitting your spark job mention --executor-cores 2 --num-executors 24
it will divide the dataset into 24*2 parquet files.
Or set spark.default.parallelism value like 50 on sparkconf object. It will
divide the dataset into 50 files into your HDFS.
-Naveen
-Original
Thank you for answering, this is all very helpful!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661p19753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs
Java 8, but so far I haven't been able to get the Spark processes to use
the right JVM at start up.
Here's the command I use for launching the cluster. Note I'm using the
user-data feature to install Java 8:
./spark-ec2
Hi,
I am getting the following error
val model = ALS.train(ratings, rank, numIterations, 0.01)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 103.0 failed 1 times, most recent failure: Lost task 1.0 in stage 103.0
(TID 3, localhost): scala.MatchError:
Hi,
I'm selecting columns from a json file, transform some of them and would
like to store the result as a parquet file but I'm failing.
This is what I'm doing:
val jsonFiles=sqlContext.jsonFile(/requests.loading)
jsonFiles.registerTempTable(jRequests)
val clean_jRequests=sqlContext.sql(select
Any idea how to resolve this?
Regards,
Venkat
From: Venkat, Ankam
Sent: Sunday, November 23, 2014 12:05 PM
To: 'user@spark.apache.org'
Subject: Spark Streaming with Python
I am trying to run network_wordcount.py example mentioned at
Thanks Liang!
It was my bad, I fat finger one of the data point, correct it and the result
match with yours.
I am still not able to get the intercept. I am getting [error]
/data/project/LinearRegression/src/main/scala/StreamingLinearRegression.scala:47:
value setIntercept
mber of
I’m not (yet!) an active Spark user, but saw this thread on twitter … and am
involved with Stanford CoreNLP.
Could someone explain how things need to be to work better with Spark — since
that would be a useful goal.
That is, while Stanford CoreNLP is not quite uniform (being developed by
I am running a 3 node(32 core, 60gb) Yarn cluster for Spark jobs.
1) Below are my Yarn memory settings
yarn.nodemanager.resource.memory-mb = 52224
yarn.scheduler.minimum-allocation-mb = 40960
yarn.scheduler.maximum-allocation-mb = 52224
Apache Spark Memory Settings
export
Hi all,
seems that all the mllib models are declared accessible in the package,
except MatrixFactorizationModel, which is declared private to mllib. Any
reason why?
thanks,
--
View this message in context:
Chris,
Thanks for stopping by! Here's a simple example. Imagine I've got a corpus
of data, which is an RDD[String], and I want to do some POS tagging on it.
In naive spark, that might look like this:
val props = new Properties.setAnnotators(pos)
val proc = new StanfordCoreNLP(props)
val data =
Any comments?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-a-local-variable-in-each-cluster-tp19604p19766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi!
I started play with Spark some days ago and now I'm configuring a little
cluster to play during my development. For this task, I'm using Apache
Mesos running in Linux container managed by Docker. The mesos master and
slave are running. I can see the webui and everything looks fine.
I am
Probably the easiest/closest way to do this would be with a UDF, something
like:
registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)
Although this does not modify a column, but instead appends a new column.
Another more
repartition and coalesce should both allow you to achieve what you
describe. Can you maybe share the code that is not working?
On Mon, Nov 24, 2014 at 8:24 PM, tridib tridib.sama...@live.com wrote:
Hello,
I am reading around 1000 input files from disk in an RDD and generating
parquet. It
You'll need to be running a very recent version of Spark SQL as this
feature was just added.
On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi,
Thanks for your reply.. I'm trying to do what you suggested but I get:
scala sqlContext.sql(CREATE TEMPORARY TABLE data
Hi
I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior.
In my use-case, I have to persist about 20 RDDs each of size
I am experimenting with two files and trying to generate 1 parquet file.
public class CompactParquetGenerator implements Serializable {
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
In 1.2, we added streaming k-means:
https://github.com/apache/spark/pull/2942 . -Xiangrui
On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact joannenetw...@gmail.com wrote:
Thank you Tobias!
On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Tue, Nov 25, 2014 at
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are
on the order of 100 million elements
I call a function to rearrange the tuples
JavaPairRDDString,Tuple2Type1,Type2 newPairs =
originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2,
String, Tuple2IType1,Type2
public void generateParquet(JavaSparkContext sc, String jsonFilePath,
String parquetPath) {
//int MB_128 = 128*1024*1024;
//sc.hadoopConfiguration().setInt(dfs.blocksize, MB_128);
//sc.hadoopConfiguration().setInt(parquet.block.size, MB_128);
JavaSQLContext
There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui
On Tue, Nov 25, 2014 at 2:39
Hi,
I am looking for some resources/tutorials that will help me achive this:
My JavaSchemaRDD is from JSON objects like below.
How do I go about writing a UDF aggregate function let's say 'vectorAgg' which
I can call from sql that returns one result array that is a positional
aggregate across
It is data-dependent, and hence needs hyper-parameter tuning, e.g.,
grid search. The first batch is certainly expensive. But after you
figure out a small range for each parameter that fits your data,
following batches should be not that expensive. There is an example
from AMPCamp:
Thank you.
How can I address more complex columns like maps and structs?
Thanks again!
Daniel
On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote:
Probably the easiest/closest way to do this would be with a UDF, something
like:
registerFunction(makeString, (s:
Besides API stability concerns, models constructed directly from users
rather than returned by ALS may not work well. The userFeatures and
productFeatures are both with partitioners so we can perform quick
lookup for prediction. If you save userFeatures and productFeatures
and load them back, it
Hi
I am noticing that the RDDs that are persisted get cleaned up very quickly.
This usually happens in a matter of a few minutes. I tried setting a value
of 20 hours for the /spark.cleaner.ttl/ property and still get the same
behavior.
In my use-case, I have to persist about 20 RDDs each of
hi Xiangrui,
thanks. that is a very useful feature.
any suggestion on saving/loading the model in the meantime?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19783.html
Sent from the Apache Spark User List
Maps should just be scala maps, structs are rows inside of rows. If you
wan to return a struct from a UDF you can do that with a case class.
On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com
wrote:
Thank you.
How can I address more complex columns like maps and structs?
We don't support native UDAs at the moment in Spark SQL. You can write a
UDA using Hive's API and use that within Spark SQL
On Tue, Nov 25, 2014 at 10:10 AM, Barua, Seemanto
seemanto.ba...@jpmchase.com.invalid wrote:
Hi,
I am looking for some resources/tutorials that will help me achive
RDDs are immutable, so calling coalesce doesn't actually change the RDD but
instead returns a new RDD that has fewer partitions. You need to save that
to a variable and call saveAsParquetFile on the new RDD.
On Tue, Nov 25, 2014 at 10:07 AM, tridib tridib.sama...@live.com wrote:
public
Ohh...how can I miss that. :(. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
How can I implement a custom MultipleOutputFormat and specify it as the
output of my Spark job so that I can ensure that there is a unique output
file per key (instead of a a unique output file per reducer)?
Thanks
Arpan
Fantastic!!! Exactly what i was looking for.
Thanks,
Natu
On Tue, Nov 25, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote:
Yes, and I prepared a basic talk on this exact topic. Slides here:
http://www.slideshare.net/srowen/anomaly-detection-with-apache-spark-41975155
This is
Thanks Michael,
It worked like a charm! I have few more queries:
1. Is there a way to control the size of parquet file?
2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
repartition(n)?
Thanks Regards
Tridib
--
View this message in context:
Problem was solved by having the admins put this file on the edge nodes.
Thanks,
Arun
On Wed, Nov 19, 2014 at 12:27 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Your Hadoop configuration is set to look for this file to determine racks.
Is the file present on cluster nodes? If not, look at
Hello
I just stumbled on exactly the same issue as you are discussing in this
thread. Here are my dependencies:
dependencies
dependency
groupIdcom.datastax.spark/groupId
artifactIdspark-cassandra-connector_2.10/artifactId
version1.1.0/version
Hi,
Arpan Ghosh wrote:
Hi,
How can I implement a custom MultipleOutputFormat and specify it as
the output of my Spark job so that I can ensure that there is a unique
output file per key (instead of a a unique output file per reducer)?
I use something like this:
class KeyBasedOutput[T :
Hi,
I have written few datastructures as classes like following..
So, here is my code structure:
project/foo/foo.py , __init__.py
/bar/bar.py, __init__.py bar.py imports foo as from foo.foo
import *
/execute/execute.py imports bar as from bar.bar import *
Ultimately I am
I guess you want to use split(\\|) instead of split(|).
On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote:
Which version are you using? Or if you are using the most recent master or
branch-1.2, which commit are you using?
On 11/25/14 4:08 PM, david wrote:
Hi,
I
I am running into the following NullPointerException:
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (scala.collection.convert.Wrappers$JListWrapper)
myArrayField (MyCaseClass)
at
a quick thought on this: I think this is distro dependent also, right?
We ran into a similar issue in
https://issues.apache.org/jira/browse/BIGTOP-1546 where it looked like the
python libraries might be overwritten on launch.
On Tue, Nov 25, 2014 at 3:09 PM, Chengi Liu chengi.liu...@gmail.com
Hi Steve,
You changed the first value in a Tuple2, which is the one that Spark uses
to hash and determine where in the cluster to place the value. By changing
the first part of the PairRDD, you've implicitly asked Spark to reshuffle
the data according to the new keys. I'd guess that you would
Hi,
I'm trying to make custom input format for CSV file, if you can share little
bit more what you read as input and what things you have implemented. I'll
try to replicate the same things. If I find something interesting at my end
I'll let you know.
Thanks,
Harihar
-
--Harihar
--
How are you creating the object in your Scala shell? Maybe you can write a
function that directly returns the RDD, without assigning the object to a
temporary variable.
Matei
On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote:
The closer I look @ the stack trace in the Scala
I am using Spark SQL from Hive table with Parquet SerDe. Most queries are
executed from Spark's JDBC Thrift server. Is there more efficient way to
access/query data? For example, using saveAsParquetFile() and parquetFile()
to save/load Parquet data and run queries directly?
Thanks,
Ken
--
View
I was wiring up my job in the shell while i was learning Spark/Scala. I'm
getting more comfortable with them both now so I've been mostly testing
through Intellij with mock data as inputs.
I think the problem lies more on Hadoop than Spark as the Job object seems
to check it's state and throw an
Leon,
I solved the problem by creating a work around for it, so didn't have a need to
upgrade to 1.1.2-SNAPSHOT.
Mohammed
-Original Message-
From: Leon [mailto:pachku...@gmail.com]
Sent: Tuesday, November 25, 2014 11:36 AM
To: u...@spark.incubator.apache.org
Subject: RE: Spark SQL
Hi All,
I have spark deployed to an EC2 cluster and were able to run jobs successfully
when drive is reside within the cluster. However, job was killed when I tried
to submit it from local. My guess is spark cluster can’t open connection back
to the driver since it is on my machine.
I’m
Hi,
I am trying to launch a spark 1.2 cluster with SparkSQL and custom
authentication. After launching the cluster using the ec2 scripts, I copied
the following hive-site.xml file into spark/conf dir:
/configuration
property
namehive.server2.authentication/name
valueCUSTOM/value
/property
Two options that I can think of:
1) Use the Spark SQL Thrift/JDBC server.
2) Develop a web app using some framework such as Play and expose a set of
REST APIs for sending queries. Inside your web app backend, you initialize the
Spark SQL context only once when your app initializes.
Thanks, Cheng.
As an FYI for others trying to integrate Spark SQL JDBC server with Cassandra -
I ended up using CalliopeServer2, which extends the Thrift Server and it was
really straightforward.
Mohammed
From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Saturday, November 22, 2014 3:54
I traced the code and used the following to call:
Spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal
--hiveconf hive.server2.thrift.port=1
The issue ended up to be much more fundamental however. Spark doesn’t
If I combineByKey in the next step I suppose I am paying for a shuffle I
need any way - right?
Also if I supply a custom partitioner rather than hash can I control where
and how data is shuffled - overriding equals and hashcode could be a bad
thing but a custom partitioner is less dangerous
On
To determine if this is a Windows vs. other configuration, can you just try
to call the Spark-class.cmd SparkSubmit without actually referencing the
Hadoop or Thrift server classes?
On Tue Nov 25 2014 at 5:42:09 PM Judy Nash judyn...@exchange.microsoft.com
wrote:
I traced the code and used
I believe coalesce(..., true) and repartition are the same. If the input
files are of similar sizes, then coalesce will be cheaper as it introduces a
narrow dependency
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf,
meaning there won't be a shuffle. However, if there
Yeah, unfortunately that will be up to them to fix, though it wouldn't hurt to
send them a JIRA mentioning this.
Matei
On Nov 25, 2014, at 2:58 PM, Corey Nolet cjno...@gmail.com wrote:
I was wiring up my job in the shell while i was learning Spark/Scala. I'm
getting more comfortable with
Hello Spark fans,
I am trying to use the IDF model available in the spark mllib to create an
tf-idf representation of a n RDD[Vectors]. Below i have attached my MWE
I get the following error
java.lang.IndexOutOfBoundsException: 7 not in [-4,4)
at
Hello forum,
We are using spark distro built from the source of latest 1.2.0 tag.
And we are facing the below issue, while trying to act upon the JavaRDD
instance, the stacktrace is given below.
Can anyone please let me know, what can be wrong here?
java.lang.ClassCastException: [B cannot be
Hi,
In the Spark on YARN, the AM (driver) will ask the RM for resources. Once
the resources are allocated by the RM, the AM will start the executors
through the NM. This is my understanding.
But, according to the Spark documentation (1), the
`spark.yarn.applicationMaster.waitTries` properties
Hi Tri,
setIntercept() is not a member function
of StreamingLinearRegressionWithSGD, it's a member function
of LinearRegressionWithSGD(GeneralizedLinearAlgorithm) which is a member
variable(named algorithm) of StreamingLinearRegressionWithSGD.
So you need to change your code to:
val model = new
Hi Shivani,
You misunderstand the parameter of SparseVector.
class SparseVector(
override val size: Int,
val indices: Array[Int],
val values: Array[Double]) extends Vector {
}
The first parameter is the total length of the Vector rather than the
length of non-zero elements.
So it
Hi,
The spark assembly is time costly. If I only need
the spark-assembly-1.1.0-hadoop2.3.0.jar, do not need
the spark-examples-1.1.0-hadoop2.3.0.jar. How to configure the spark to
avoid assemble the example jar. I know *export
SPARK_PREPEND_CLASSES=**true* method
can reduce the assembly, but
You can do sbt/sbt assembly/assembly to assemble only the main package.
Matei
On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote:
Hi,
The spark assembly is time costly. If I only need the
spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the
BTW as another tip, it helps to keep the SBT console open as you make source
changes (by just running sbt/sbt with no args). It's a lot faster the second
time it builds something.
Matei
On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
You can do sbt/sbt
Hello forum,
We are using spark distro built from the source of latest 1.2.0 tag.
And we are facing the below issue, while trying to act upon the JavaRDD
instance, the stacktrace is given below.
Can anyone please let me know, what can be wrong here?
java.lang.ClassCastException: [B cannot be
Hi,
Thanks for your help!
Sandy, I had a bit of trouble finding the spark.executor.cores property.
(It wasn't there although its value should have been 2.)
I ended up throwing regular expressions
on scala.util.Properties.propOrElse(sun.java.command, ), which worked
surprisingly well ;-)
Thanks
Thanks Yanbo.
My issue was 1) . I had spark thrift server setup, but it was running against
hive instead of Spark SQL due a local change.
After I fix this, beeline automatically caches rerun queries + accepts cache
table.
From: Yanbo Liang [mailto:yanboha...@gmail.com]
Sent: Friday, November
Hi everyone,
I deployed Spark 1.1.0 and I m trying to use it with spark-job-server 0.4.0
(https://github.com/ooyala/spark-jobserver).
I previously used Spark 1.0.2 and had no problems with it. I want to use the
newer version of Spark (and Spark SQL) to create the SchemaRDD programmatically.
Hi,
I am trying to access the posterior probability of Naive Baye's prediction
with MLlib using Java. As the member variables brzPi and brzTheta are
private, I applied a hack to access the values through reflection.
I am using Java and couldn't find a way to use the breeze library with Java.
If
Hi,
Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet support.
I got the following exceptions:
org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must
implement HiveOutputFormat, otherwise it should be either
IgnoreKeyTextOutputFormat or SequenceFileOutputFormat
Oh, I found a explanation from
http://cmenguy.github.io/blog/2013/10/30/using-hive-with-parquet-format-in-cdh-4-dot-3/
The error here is a bit misleading, what it really means is that the class
parquet.hive.DeprecatedParquetOutputFormat isn’t in the classpath for Hive.
Sure enough, doing a ls
Pre-processing is major workload before training model.
MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is
essential for preprocessing and would be great help to the model training.
Take a look at this
http://spark.apache.org/docs/latest/mllib-feature-extraction.html
Looks like a config issue. I ran spark-pi job and still failing with the same
guava error
Command ran:
.\bin\spark-class.cmd org.apache.spark.deploy.SparkSubmit --class
org.apache.spark.examples.SparkPi --master spark://headnodehost:7077
--executor-memory 1G --num-executors 1
Hi All,
I just installed a spark on my laptop and trying to get spark-shell to
work. Here is the error I see:
C:\spark\binspark-shell
Exception in thread main java.util.NoSuchElementException: key not found:
CLAS
SPATH
at scala.collection.MapLike$class.default(MapLike.scala:228)
I'm running a spark-ec2 cluster.
I have a map task that calls a specialized C++ external app. The app doesn't
fully utilize the core as it needs to download/upload data as part of the
task. Looking at the worker nodes, it appears that there is one task with my
app running per core.
I'd like to
Any pointers guys?
On Tue, Nov 25, 2014 at 5:32 PM, Mukesh Jha me.mukesh@gmail.com wrote:
Hey Experts,
I wanted to understand in detail about the lifecycle of rdd(s) in a
streaming app.
From my current understanding
- rdd gets created out of the realtime input stream.
- Transform(s)
Yes, it is possible to submit jobs to a remote spark cluster. Just make
sure you follow the below steps.
1. Set spark.driver.host to your local ip (Where you runs your code, and it
should be accessible from the cluster)
2. Make sure no firewall/router configurations are blocking/filtering the
You could try following this guidelines
http://docs.sigmoidanalytics.com/index.php/How_to_build_SPARK_on_Windows
Thanks
Best Regards
On Wed, Nov 26, 2014 at 12:24 PM, Sunita Arvind sunitarv...@gmail.com
wrote:
Hi All,
I just installed a spark on my laptop and trying to get spark-shell to
Mater, thank you very much!
After take your advice, the time for assembly from about 20min down to 6min
in my computer. that's a very big improvement.
On Wed, Nov 26, 2014 at 12:32 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
BTW as another tip, it helps to keep the SBT console open as you
Matei, sorry for my last typo error. And the tip can improve about 30s in
my computer.
On Wed, Nov 26, 2014 at 3:34 PM, lihu lihu...@gmail.com wrote:
Mater, thank you very much!
After take your advice, the time for assembly from about 20min down to
6min in my computer. that's a very big
Hi Sunita,
This gitbook may also be useful for you to get Spark running in local mode
on your Windows machine:
http://blueplastic.gitbooks.io/how-to-light-your-spark-on-a-stick/content/
On Tue, Nov 25, 2014 at 11:09 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You could try following this
98 matches
Mail list logo