Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu
Could you provide all pieces of codes which can reproduce the bug? Here is my test code: import org.apache.spark._ import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName(SimpleApp) val sc = new SparkContext(conf)

Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson
This may be due in part to Scala allocating an anonymous inner class in order to execute the for loop. I would expect if you change it to a while loop like var i = 0 while (i 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x) i += 1 } then the problem may go away. I am not

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu
Michael any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 2:36 PM To: mich...@databricks.com; user Subject: CheckPoint Issue with JsonRDD When we enable checkpoint and use JsonRDD we get the following error: Is this bug ?

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
I am trying to group by on a calculated field. Is it supported on spark sql? I am running it on a nested json structure. Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c group by YEAR(c.Patient.DOB) Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.

about write mongodb in mapPartitions

2014-11-07 Thread qinwei
Hi, everyone     I come across with a prolem about writing data to mongodb in mapPartitions, my code is as below:                 val sourceRDD = sc.textFile(hdfs://host:port/sourcePath)          // some transformations        val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val

Re: about write mongodb in mapPartitions

2014-11-07 Thread Akhil Das
Why not saveAsNewAPIHadoopFile? //Define your mongoDB confs val config = new Configuration() config.set(mongo.output.uri, mongodb:// 127.0.0.1:27017/sigmoid.output) //Write everything to mongo rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any], classOf[Any],

Re: multiple spark context in same driver program

2014-11-07 Thread Akhil Das
My bad, I just fired up a spark-shell and created a new sparkContext and it was working fine. I basically did a parallelize and collect with both sparkContexts. Thanks Best Regards On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Nov 7, 2014 at 4:58 PM,

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List, Has anybody had experience integrating C/C++ code into Spark jobs? I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu
Now it doesn't support such query. I can easily reproduce it. Created a JIRA here: https://issues.apache.org/jira/browse/SPARK-4296 Best Regards, Shixiong Zhu 2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com: I am trying to group by on a calculated field. Is it supported on

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-11-07 Thread Sree Harsha
@rogthefrog Were you able to figure out how to fix this issue? Even I tried all combinations that possible but no luck yet. Thanks, Harsha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html

MESOS slaves shut down due to 'health check timed out

2014-11-07 Thread Yangcheng Huang
Hi guys Do you know how to handle the following case - = From MESOS log file = Slave asked to shut down by master@:5050 because 'health check timed out' I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework === Any configurations

Re: Store DStreams into Hive using Hive Streaming

2014-11-07 Thread Luiz Geovani Vier
Hi Ted and Silvio, thanks for your responses. Hive has a new API for streaming ( https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) that takes care of compaction and doesn't require any downtime for the table. The data is immediately available and Hive will combine files in

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some

error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto
I'm getting this error when importing hive context from pyspark.sql import HiveContext Traceback (most recent call last): File stdin, line 1, in module File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
thanks reza. i'm not familiar with the block matrix multiplication, but is it a good fit for very large dimension, but extremely sparse matrix? if not, what is your recommendation on implementing matrix multiplication in spark on very large dimension, but extremely sparse matrix? On Thu, Nov

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
Currently I see the word2vec model is collected onto the master, so the model itself is not distributed.  I guess the question is why do you need  a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing

where is the org.apache.spark.util package?

2014-11-07 Thread ll
i'm trying to compile some of the spark code directly from the source (https://github.com/apache/spark). it complains about the missing package org.apache.spark.util. it doesn't look like this package is part of the source code on github. where can i find this package? -- View this message

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want. (At graphflow we've built our own). You hold the factor matrices in memory and do the dot product in

Re: where is the org.apache.spark.util package?

2014-11-07 Thread ll
i found util package under spark core package, but i now got this error Sysmbol Utils is inaccessible from this place. what does this error mean? the org.apache.spark.util and org.apache.spark.spark.Utils are there now. thanks. -- View this message in context:

Re: deploying a model built in mllib

2014-11-07 Thread chirag lakhani
Thanks for letting me know about this, it looks pretty interesting. From reading the documentation it seems that the server must be built on a Spark cluster, is that correct? Is it possible to deploy it in on a Java server? That is how we are currently running our web app. On Tue, Nov 4,

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
hi nick.. sorry about the confusion. originally i had a question specifically about word2vec, but my follow up question on distributed model is a more general question about saving different types of models. on distributed model, i was hoping to implement a model parallelism, so that different

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
yep, but that's only if they are already represented as RDDs. which is much more convenient for saving and loading. my question is for the use case that they are not represented as RDDs yet. then, do you think if it makes sense to covert them into RDDs, just for the convenience of saving and

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
thansk nick. i'll take a look at oryx and prediction.io. re: private val model in word2vec ;) yes, i couldn't wait so i just changed it in the word2vec source code. but i'm running into some compiliation issue now. hopefully i can fix it soon, so to get this things going. On Fri, Nov 7, 2014

Re: AVRO specific records

2014-11-07 Thread Simone Franzini
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that I have not fully solved yet. I am able to run with Hadoop1 and AVRO in standalone mode but not with Hadoop2 (even after trying to fix the dependencies). Anyway, I am now trying to write to AVRO, using a very similar

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh
If you're have very large and very sparse matrix represented as (i, j, value) entries, then you can try the algorithms mentioned in the post https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA brought up earlier. Reza On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh duy.huynh@gmail.com

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Simon Chan
Just want to elaborate more on Duy's suggestion on using PredictionIO. PredictionIO will store the model automatically if you return it in the training function. An example using CF: def train(data: PreparedData): PersistentMatrixFactorizationModel = { val m = ALS.train(data.ratings,

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Michael Armbrust
Perhaps if you can describe what you are trying to accomplish at high level it'll be easier to help. On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: Any idea on this? From: Jahagirdar, Madhu Sent: Thursday,

partitioning to speed up queries

2014-11-07 Thread Gordon Benjamin
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT

Multiple Applications(Spark Contexts) Concurrently Fail With Broadcast Error

2014-11-07 Thread ryaminal
We are unable to run more than one application at a time using Spark 1.0.0 on CDH5. We submit two applications using two different SparkContexts on the same Spark Master. The Spark Master was started using the following command and parameters and is running in standalone mode:

Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta
I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole

jsonRdd and MapType

2014-11-07 Thread boclair
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). I'd like some of the json fields to be in a MapType rather than a sub StructType, as the keys will be very sparse. For example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD

Re: Still struggling with building documentation

2014-11-07 Thread Nicholas Chammas
I believe the web docs need to be built separately according to the instructions here https://github.com/apache/spark/blob/master/docs/README.md. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Tathagata Das
I am not aware of any obvious existing pattern that does exactly this. Generally this sort of computation (subset, denormalization) things are so generic sounding terms but actually have very specific requirements that it hard to refer to a design pattern without more requirement info. If you

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian, Right now, MapType is not supported in the StructType provided to jsonRDD/jsonFile. We will add the support. I have created https://issues.apache.org/jira/browse/SPARK-4302 to track this issue. Thanks, Yin On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote: I'm

Re: deploying a model built in mllib

2014-11-07 Thread Donald Szeto
Hi Chirag, Could you please provide more information on your Java server environment? Regards, Donald ᐧ On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Thanks for letting me know about this, it looks pretty interesting. From reading the documentation it seems

Re: Parallelize on spark context

2014-11-07 Thread _soumya_
Naveen, Don't be worried - you're not the only one to be bitten by this. A little inspection of the Javadoc told me you have this other option: JavaRDDInteger distData = sc.parallelize(data, 100); -- Now the RDD is split into 100 partitions. -- View this message in context:

spark streaming: stderr does not roll

2014-11-07 Thread Nguyen, Duc
We are running spark streaming jobs (version 1.1.0). After a sufficient amount of time, the stderr file grows until the disk is full at 100% and crashes the cluster. I've read this https://github.com/apache/spark/pull/895 and also read this

Integrating Spark with other applications

2014-11-07 Thread gtinside
Hi , I have been working on Spark SQL and want to expose this functionality to other applications. Idea is to let other applications to send sql to be executed on spark cluster and get the result back. I looked at spark job server (https://github.com/ooyala/spark-jobserver) but it provides a

Re: Integrating Spark with other applications

2014-11-07 Thread Thomas Risberg
Hi, I'm a committer on that spring-hadoop project and I'm also interested in integrating Spark with other Java applications. I would love to see some guidance from the Spark community for the best way to accomplish this. We have plans to add features to work with Spark Apps in similar ways we now

Re: error when importing HiveContext

2014-11-07 Thread Davies Liu
bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to setup it by yourself. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I’m getting this error when importing hive context from

spark context not defined

2014-11-07 Thread Pagliari, Roberto
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I'm getting 'sc not defined' On the other hand, I can see 'sc' from pyspark CLI. Is there a way to fix it?

MatrixFactorizationModel serialization

2014-11-07 Thread Dariusz Kobylarz
I am trying to persist MatrixFactorizationModel (Collaborative Filtering example) and use it in another script to evaluate/apply it. This is the exception I get when I try to use a deserialized model instance: Exception in thread main java.lang.NullPointerException at

Re: MatrixFactorizationModel serialization

2014-11-07 Thread Sean Owen
Serializable like a Java object? no, it's an RDD. A factored matrix model is huge, unlike most models, and is not a local object. You can of course persist the RDDs to storage manually and read them back. On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz darek.kobyl...@gmail.com wrote: I am

SparkPi endlessly in yarnAppState: ACCEPTED

2014-11-07 Thread YaoPau
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output after submitting the SparkPi example in yarn cluster mode (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html) using: spark-submit --class

Re: SparkPi endlessly in yarnAppState: ACCEPTED

2014-11-07 Thread jayunit100
Sounds like no free yarn workers. i.e. try running: hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1 We have some smoke tests which you might find particularly usefull for yarn clusters as well in https://github.com/apache/bigtop, underneath bigtop-tests/smoke-tests which are generally good to

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-07 Thread Davies Liu
Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it

Spark 1.1.0 Can not read snappy compressed sequence file

2014-11-07 Thread Stéphane Verlet
I first saw this using SparkSQL but the result is the same with plain Spark. 14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at

How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x - table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code?

Fwd: How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x - table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code? Thanks,

Re: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread ll
hi. i did use local[8] as below, but it still ran on only 1 core. val sc = new SparkContext(new SparkConf().setMaster(local[8]).setAppName(abc)) any advice is much appreciated. -- View this message in context:

RE: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread Ganelin, Ilya
To set the number of spark cores used you must set two parameters in the actual spark-submit script. You must set num-executors (the number of nodes to have) and executor-cores (the number of cores per machinel) . Please see the Spark configuration and tuning pages for more details.

Re: How to add elements into map?

2014-11-07 Thread lalit1303
It doesn't work that way. Following is the correct way: val table = sc.textFile(args(1)) val histMap = table.map(x = { x.split('|')(0).toInt,1 }) - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

Re: Viewing web UI after fact

2014-11-07 Thread Arun Ahuja
We are running our applications through YARN and are only somtimes seeing them into the History Server. Most do not seem to have the APPLICATION_COMPLETE file. Specifically any job that ends because of yarn application -kill does not show up. For other ones what would be a reason for them not

Re: Using partitioning to speed up queries in Shark

2014-11-07 Thread Mayur Rustagi
- dev list + user list Shark is not officially supported anymore so you are better off moving to Spark SQL. Shark doesnt support Hive partitioning logic anyways, it has its version of partitioning on in-memory blocks but is independent of whether you partition your data in hive or not. Mayur