Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Andy Davidson
Hi Felix and Ted This is how I am starting spark Should I file a bug? Andy export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN" $SPARK_ROOT/bin/pyspark \ --master $MASTER_URL \

Multiplication on decimals in a dataframe query

2015-11-29 Thread Philip Dodds
I hit a weird issue when I tried to multiply to decimals in a select (either in scala or as SQL), and Im assuming I must be missing the point. The issue is fairly easy to recreate with something like the following: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import

Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Spark Job server allows you to submit your apps to any kind of deployment (Standalone, Cluster). I think that it could be suitable for your use case. Check the following Github repo: https://github.com/spark-jobserver/spark-jobserver Ardo On Sun, Nov 29, 2015 at 6:42 PM, Նարեկ Գալստեան

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Hi. Your code is like this right? "/joined_dataset = show_channel.join(show_views) joined_dataset.take(4)/" well /joined_dataset / is now an array (because you used /.take(4)/ ). So it does not support any RDD operations.. Could that be the problem? Otherwise more code is needed to

Re: Spark and simulated annealing

2015-11-29 Thread Gylfi
1) Start by looking at ML-lib or KeystoneML 2) If you can't find an impl., start by analyzing the access patterns and data manipulations you will need to implement. 3) Then figure out if it fits Spark structures.. and when you realized it doesn't you start speculating on how you can twist or

Re: Debug Spark

2015-11-29 Thread Նարեկ Գալստեան
A question regarding the topic, I am using Intellij to write spark applications and then have to ship the source code to my cluster on the cloud to compile and test is there a way to automatise the process using Intellij? Narek Galstyan Նարեկ Գալստյան On 29 November 2015 at 20:51, Ndjido Ardo

Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
Masf, the following link sets the basics to start debugging your spark apps in local mode: https://medium.com/large-scale-data-processing/how-to-kick-start-spark-development-on-intellij-idea-in-4-steps-c7c8f5c2fe63#.675s86940 Ardo On Sun, Nov 29, 2015 at 5:34 PM, Masf

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread arnalone
Thanks for replying so fast! it was not clear. my code is : joined_dataset = show_channel.join(show_views) for your knowledge, the first lines are joined_dataset.take(4) Out[93]: [(u'PostModern_Cooking', (u'DEF', 1038)), (u'PostModern_Cooking', (u'DEF', 415)), (u'PostModern_Cooking', (u'DEF',

Re: Debug Spark

2015-11-29 Thread Danny Stephan
Hi, You can use “jwdp" to debug everything that run on top of JVM including Spark. Specific with IntelliJ, maybe this link can help you: http://danosipov.com/?p=779 regards, Danny > Op 29 nov. 2015, om 17:34 heeft Masf het volgende >

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Ted Yu
I think you should file a bug. > On Nov 29, 2015, at 9:48 AM, Andy Davidson > wrote: > > Hi Felix and Ted > > This is how I am starting spark > > Should I file a bug? > > Andy > > > export PYSPARK_PYTHON=python3.4 > export PYSPARK_DRIVER_PYTHON=python3.4 >

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thanks for the reply Yanbo. I understand that the model will be trained using the indexer map created during the training stage. But since I am getting a new set of data during prediction, and I have to do StringIndexing on the new data also, Right now I am using a new StringIndexer for this

RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Felix Cheung
Actually upon closer look PYTHONHASHSEED should be set (in worker) when your create a SparkContext https://github.com/apache/spark/blob/master/python/pyspark/context.py#L166 And it should also be set from spark-submit or pyspark. Can you check sys.version and

Re: Retrieve best parameters from CrossValidator

2015-11-29 Thread Yanbo Liang
Hi Ben, We can get the best model from CrossValidatorModel.BestModel, further more we can use the write function of CrossValidatorModel to implement model persistent and use the

Re: Retrieve best parameters from CrossValidator

2015-11-29 Thread Benjamin Fradet
Hi Yanbo, Thanks for your answer, I'm looking forward to 1.6 then. On Sun, Nov 29, 2015 at 3:44 PM, Yanbo Liang wrote: > Hi Ben, > > We can get the best model from CrossValidatorModel.BestModel, further > more we can use the write function of CrossValidatorModel >

Re: Spark Streaming on mesos

2015-11-29 Thread Timothy Chen
Hi Renjie, You can set number of cores per executor with spark executor cores in fine grain mode. If you want coarse grain mode to support that it will Be supported in the near term as he coarse grain scheduler is getting revamped now. Tim > On Nov 28, 2015, at 7:31 PM, Renjie Liu

Re: Help with Couchbase connector error

2015-11-29 Thread Eyal Sharon
Thanks guys , that was very helpful On Thu, Nov 26, 2015 at 10:29 PM, Shixiong Zhu wrote: > Het Eyal, I just checked the couchbase spark connector jar. The target > version of some of classes are Java 8 (52.0). You can create a ticket in >

Re: storing result of aggregation of spark streaming

2015-11-29 Thread Ted Yu
There is hbase connector: https://github.com/nerdammer/spark-hbase-connector In the upcoming hbase 2.0 release, hbase-spark module provides support for Spark directly. Cheers On Sat, Nov 28, 2015 at 10:25 PM, Michael Spector wrote: > Hi Amir, > > You can store results

Debug Spark

2015-11-29 Thread Masf
Hi Is it possible to debug spark locally with IntelliJ or another IDE? Thanks -- Regards. Miguel Ángel

Re: Debug Spark

2015-11-29 Thread Ndjido Ardo BAR
hi, IntelliJ is just great for that! cheers, Ardo. On Sun, Nov 29, 2015 at 5:18 PM, Masf wrote: > Hi > > Is it possible to debug spark locally with IntelliJ or another IDE? > > Thanks > > -- > Regards. > Miguel Ángel >

Re: Debug Spark

2015-11-29 Thread Masf
Hi Ardo Some tutorial to debug with Intellij? Thanks Regards. Miguel. On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR wrote: > hi, > > IntelliJ is just great for that! > > cheers, > Ardo. > > On Sun, Nov 29, 2015 at 5:18 PM, Masf wrote: > >> Hi >>

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Jeff Zhang
StringIndexer is an estimator which would train a model to be used both in training & prediction. So it is consistent between training & prediction. You may want to read this section of spark ml doc http://spark.apache.org/docs/latest/ml-guide.html#how-it-works On Mon, Nov 30, 2015 at 12:52

Re: Spark Streaming on mesos

2015-11-29 Thread Renjie Liu
Hi, Tim: Fine grain mode is not suitable for streaming applications since it need to start up an executor each time. When will the revamp get release? In the coming 1.6.0? On Sun, Nov 29, 2015 at 6:16 PM Timothy Chen wrote: > Hi Renjie, > > You can set number of cores per

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Hi. Can't you do a filter, to get only the ABC shows, map that into a keyed instance of the show, and then do a reduceByKey to sum up the views? Something like this in Scala code: /filter for the channel new pair (show, view count) / val myAnswer = joined_dataset.filter( _._2._1 == "ABC"

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread arnalone
Yes that 's what I am trying to do, but I do not manage to "point" on the channel field to filter on "ABC" and then in the map step to get only shows and views. In scala you do it with (_._2._1 == "ABC") and (_._1, _._2._2), but I don't find the right syntax in python to do the same :( -- View

Re: General question on using StringIndexer in SparkML

2015-11-29 Thread Vishnu Viswanath
Thank you Jeff. On Sun, Nov 29, 2015 at 7:36 PM, Jeff Zhang wrote: > StringIndexer is an estimator which would train a model to be used both in > training & prediction. So it is consistent between training & prediction. > > You may want to read this section of spark ml doc >

Re: DateTime Support - Hive Parquet

2015-11-29 Thread Cheng Lian
Oh sorry, you're right. Implicit conversion doesn't affect the schema inference process. Just checked that Joda is already a direct dependency of Spark. So I think it's probably fine to add support for recognizing Joda DateTime as SQL TimestampType. Would you mind to file a JIRA ticket for

Re: Parquet files not getting coalesced to smaller number of files

2015-11-29 Thread Cheng Lian
RDD.coalesce(n) returns a new RDD rather than modifying the original RDD. So what you need is: metricsToBeSaved.coalesce(1500).saveAsNewAPIHadoopFile(...) Cheng On 11/29/15 12:21 PM, SRK wrote: Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My

Re: partition RDD of images

2015-11-29 Thread Gylfi
Look at KeystoneML, there is an image processing pipeline there -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/partition-RDD-of-images-tp25515p25518.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Can't you just access it by element, like with [0] and [1] ? http://www.tutorialspoint.com/python/python_tuples.htm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-work-with-a-joined-rdd-in-pyspark-tp25510p25517.html Sent from the Apache Spark

spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.*

2015-11-29 Thread our...@cnsuning.com
hi all, throw java.lang.ArrayIndexOutOfBoundsException when I use following spark sql on spark standlone or yarn. the sql: select ta.* from bi_td.dm_price_seg_td tb join bi_sor.sor_ord_detail_tf ta on 1 = 1 where ta.sale_dt = '20140514' and ta.sale_price >= tb.pri_from