Could you provide all pieces of codes which can reproduce the bug? Here is
my test code:
import org.apache.spark._
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(SimpleApp)
val sc = new SparkContext(conf)
This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like
var i = 0
while (i 10) {
sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
i += 1
}
then the problem may go away. I am not
Michael any idea on this?
From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD
When we enable checkpoint and use JsonRDD we get the following error: Is this
bug ?
I am trying to group by on a calculated field. Is it supported on spark sql? I
am running it on a nested json structure.
Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c
group by YEAR(c.Patient.DOB)
Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
Hi, everyone
I come across with a prolem about writing data to mongodb in mapPartitions,
my code is as below: val sourceRDD =
sc.textFile(hdfs://host:port/sourcePath) // some transformations
val rdd= sourceRDD .map(mapFunc).filter(filterFunc) val
Why not saveAsNewAPIHadoopFile?
//Define your mongoDB confs
val config = new Configuration()
config.set(mongo.output.uri, mongodb://
127.0.0.1:27017/sigmoid.output)
//Write everything to mongo
rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any],
classOf[Any],
My bad, I just fired up a spark-shell and created a new sparkContext and it
was working fine. I basically did a parallelize and collect with both
sparkContexts.
Thanks
Best Regards
On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Fri, Nov 7, 2014 at 4:58 PM,
Dear List,
Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native
Now it doesn't support such query. I can easily reproduce it. Created a
JIRA here: https://issues.apache.org/jira/browse/SPARK-4296
Best Regards,
Shixiong Zhu
2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com:
I am trying to group by on a calculated field. Is it supported on
@rogthefrog
Were you able to figure out how to fix this issue?
Even I tried all combinations that possible but no luck yet.
Thanks,
Harsha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html
Hi guys
Do you know how to handle the following case -
= From MESOS log file =
Slave asked to shut down by master@:5050 because 'health
check timed out'
I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework
===
Any configurations
Hi Ted and Silvio, thanks for your responses.
Hive has a new API for streaming (
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
that takes care of compaction and doesn't require any downtime for the
table. The data is immediately available and Hive will combine files in
you're right, serialization works.
what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts of the model are in other
clusters. during runtime, these sub-models run independently in their own
clusters (load, train, save). and at some
I'm getting this error when importing hive context
from pyspark.sql import HiveContext
Traceback (most recent call last):
File stdin, line 1, in module
File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module
from pyspark.context import SparkContext
File
thanks reza. i'm not familiar with the block matrix multiplication, but
is it a good fit for very large dimension, but extremely sparse matrix?
if not, what is your recommendation on implementing matrix multiplication
in spark on very large dimension, but extremely sparse matrix?
On Thu, Nov
Currently I see the word2vec model is collected onto the master, so the model
itself is not distributed.
I guess the question is why do you need a distributed model? Is the vocab size
so large that it's necessary? For model serving in general, unless the model is
truly massive (ie cannot
There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.
You can save these RDDs out to disk by doing
i'm trying to compile some of the spark code directly from the source
(https://github.com/apache/spark). it complains about the missing package
org.apache.spark.util. it doesn't look like this package is part of the
source code on github.
where can i find this package?
--
View this message
For ALS if you want real time recs (and usually this is order 10s to a few 100s
ms response), then Spark is not the way to go - a serving layer like Oryx, or
prediction.io is what you want.
(At graphflow we've built our own).
You hold the factor matrices in memory and do the dot product in
i found util package under spark core package, but i now got this error
Sysmbol Utils is inaccessible from this place.
what does this error mean?
the org.apache.spark.util and org.apache.spark.spark.Utils are there now.
thanks.
--
View this message in context:
Thanks for letting me know about this, it looks pretty interesting. From
reading the documentation it seems that the server must be built on a Spark
cluster, is that correct? Is it possible to deploy it in on a Java
server? That is how we are currently running our web app.
On Tue, Nov 4,
hi nick.. sorry about the confusion. originally i had a question
specifically about word2vec, but my follow up question on distributed model
is a more general question about saving different types of models.
on distributed model, i was hoping to implement a model parallelism, so
that different
yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.
my question is for the use case that they are not represented as RDDs yet.
then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and
thansk nick. i'll take a look at oryx and prediction.io.
re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code. but i'm running into some compiliation
issue now. hopefully i can fix it soon, so to get this things going.
On Fri, Nov 7, 2014
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that
I have not fully solved yet. I am able to run with Hadoop1 and AVRO in
standalone mode but not with Hadoop2 (even after trying to fix the
dependencies).
Anyway, I am now trying to write to AVRO, using a very similar
If you're have very large and very sparse matrix represented as (i, j,
value) entries, then you can try the algorithms mentioned in the post
https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA brought
up earlier.
Reza
On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh duy.huynh@gmail.com
Just want to elaborate more on Duy's suggestion on using PredictionIO.
PredictionIO will store the model automatically if you return it in the
training function.
An example using CF:
def train(data: PreparedData): PersistentMatrixFactorizationModel = {
val m = ALS.train(data.ratings,
Perhaps if you can describe what you are trying to accomplish at high level
it'll be easier to help.
On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu
madhu.jahagir...@philips.com wrote:
Any idea on this?
From: Jahagirdar, Madhu
Sent: Thursday,
Hi All,
I'm using Spark/Shark as the foundation for some reporting that I'm doing
and have a customers table with approximately 3 million rows that I've
cached in memory.
I've also created a partitioned table that I've also cached in memory on a
per day basis
FROM
customers_cached
INSERT
We are unable to run more than one application at a time using Spark 1.0.0 on
CDH5. We submit two applications using two different SparkContexts on the
same Spark Master. The Spark Master was started using the following command
and parameters and is running in standalone mode:
I finally came to realize that there is a special maven target to build the
scaladocs, although arguably a very unintuitive on: mvn verify. So now I
have scaladocs for each package, but not for the whole spark project.
Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.
For example:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
I believe the web docs need to be built separately according to the
instructions here
https://github.com/apache/spark/blob/master/docs/README.md.
Did you give those a shot?
It's annoying to have a separate thing with new dependencies in order to
build the web docs, but that's how it is at the
I am not aware of any obvious existing pattern that does exactly this.
Generally this sort of computation (subset, denormalization) things are so
generic sounding terms but actually have very specific requirements that it
hard to refer to a design pattern without more requirement info.
If you
Hello Brian,
Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.
Thanks,
Yin
On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote:
I'm
Hi Chirag,
Could you please provide more information on your Java server environment?
Regards,
Donald
ᐧ
On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani chirag.lakh...@gmail.com
wrote:
Thanks for letting me know about this, it looks pretty interesting. From
reading the documentation it seems
Naveen,
Don't be worried - you're not the only one to be bitten by this. A little
inspection of the Javadoc told me you have this other option:
JavaRDDInteger distData = sc.parallelize(data, 100);
-- Now the RDD is split into 100 partitions.
--
View this message in context:
We are running spark streaming jobs (version 1.1.0). After a sufficient
amount of time, the stderr file grows until the disk is full at 100% and
crashes the cluster. I've read this
https://github.com/apache/spark/pull/895
and also read this
Hi ,
I have been working on Spark SQL and want to expose this functionality to
other applications. Idea is to let other applications to send sql to be
executed on spark cluster and get the result back. I looked at spark job
server (https://github.com/ooyala/spark-jobserver) but it provides a
Hi,
I'm a committer on that spring-hadoop project and I'm also interested in
integrating Spark with other Java applications. I would love to see some
guidance from the Spark community for the best way to accomplish this. We
have plans to add features to work with Spark Apps in similar ways we now
bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to
setup it by yourself.
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip
On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
I’m getting this error when importing hive context
from
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and
hive 0.9.0.
When using python 2.7
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
I'm getting 'sc not defined'
On the other hand, I can see 'sc' from pyspark CLI.
Is there a way to fix it?
I am trying to persist MatrixFactorizationModel (Collaborative Filtering
example) and use it in another script to evaluate/apply it.
This is the exception I get when I try to use a deserialized model instance:
Exception in thread main java.lang.NullPointerException
at
Serializable like a Java object? no, it's an RDD. A factored matrix
model is huge, unlike most models, and is not a local object. You can
of course persist the RDDs to storage manually and read them back.
On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz
darek.kobyl...@gmail.com wrote:
I am
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output
after submitting the SparkPi example in yarn cluster mode
(http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html)
using:
spark-submit --class
Sounds like no free yarn workers. i.e. try running:
hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1
We have some smoke tests which you might find particularly usefull for yarn
clusters as well in https://github.com/apache/bigtop, underneath
bigtop-tests/smoke-tests which are generally good to
Could you tell how large is the data set? It will help us to debug this issue.
On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote:
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
the same bug running the 'sort.py' example. On a smaller data set, it
I first saw this using SparkSQL but the result is the same with plain
Spark.
14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at
Here is the code I run in spark-shell:
val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x - table) {
val tuple = x.split('|')
histMap.put(tuple(0).toInt, 1)
}
Why is histMap still null?
Is there something wrong with my code?
Here is the code I run in spark-shell:
val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x - table) {
val tuple = x.split('|')
histMap.put(tuple(0).toInt, 1)
}
Why is histMap still null?
Is there something wrong with my code?
Thanks,
hi. i did use local[8] as below, but it still ran on only 1 core.
val sc = new SparkContext(new
SparkConf().setMaster(local[8]).setAppName(abc))
any advice is much appreciated.
--
View this message in context:
To set the number of spark cores used you must set two parameters in the actual
spark-submit script. You must set num-executors (the number of nodes to have)
and executor-cores (the number of cores per machinel) . Please see the Spark
configuration and tuning pages for more details.
It doesn't work that way.
Following is the correct way:
val table = sc.textFile(args(1))
val histMap = table.map(x = {
x.split('|')(0).toInt,1
})
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
We are running our applications through YARN and are only somtimes seeing
them into the History Server. Most do not seem to have the
APPLICATION_COMPLETE file. Specifically any job that ends because of yarn
application -kill does not show up. For other ones what would be a reason
for them not
- dev list + user list
Shark is not officially supported anymore so you are better off moving to
Spark SQL.
Shark doesnt support Hive partitioning logic anyways, it has its version of
partitioning on in-memory blocks but is independent of whether you
partition your data in hive or not.
Mayur
55 matches
Mail list logo