Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-17 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use DataFrame join operation on condition that they share an identity column. Thanks Yanbo 2016-08-16 20:39 GMT-07:00 ayan guha : > Hi > > Thank you for your reply. Yes, I can get prediction and original features > together. My

Re: Spark SQL 1.6.1 issue

2016-08-17 Thread Teik Hooi Beh
Hi, does the version have to be the same down to the minor version, e.g. 1.6.1 and 1.6.2 will give this issue On Thu, Aug 18, 2016 at 6:27 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > your executors/driver must not have the multiple versions of spark in > classpath, it may com

Re: error when running spark from oozie launcher

2016-08-17 Thread tkg_cangkul
you're right jean, it's mismatch library in default oozie sharelib with my spark version. i've replace it and it works normally now. thx for your help Jean. On 18/08/16 13:37, Jean-Baptiste Onofré wrote: It sounds like a mismatch in the spark version ship in oozie and the runtime one. Re

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Yanbo Liang
@Michal Yes, we have public VectorUDT in spark.mllib package at 1.6, and this class is still existing in 2.0. And from 2.0, we provide a new VectorUDT in spark.ml package and make it private temporary (will be public in the near future). Since from 2.0, spark.mllib package will be in maintenance m

Re: error when running spark from oozie launcher

2016-08-17 Thread Jean-Baptiste Onofré
It sounds like a mismatch in the spark version ship in oozie and the runtime one. Regards JB On Aug 18, 2016, 07:36, at 07:36, tkg_cangkul wrote: >hi olivier, thx for your reply. > >this is the full stacktrace : > >Failing Oozie Launcher, Main class >[org.apache.oozie.action.hadoop.SparkMain

Re: error when running spark from oozie launcher

2016-08-17 Thread tkg_cangkul
hi olivier, thx for your reply. this is the full stacktrace : Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V java.lang.NoSuchMethodError: org.apache.spark

Re: Aggregations with scala pairs

2016-08-17 Thread Jean-Baptiste Onofré
Agreed. Regards JB On Aug 18, 2016, 07:32, at 07:32, Olivier Girardot wrote: >CC'ing dev list, you should open a Jira and a PR related to it to >discuss it c.f. >https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges > > > > > >On W

Re: Aggregations with scala pairs

2016-08-17 Thread Olivier Girardot
CC'ing dev list, you should open a Jira and a PR related to it to discuss it c.f. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges On Wed, Aug 17, 2016 4:01 PM, Andrés Ivaldi iaiva...@gmail.com wrote: Hello, I'd like to report

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-17 Thread Olivier Girardot
that's another "pipeline" step to add whereas when using persist is just relevant during the lifetime of your jobs and not in HDFS but in the local disk of your executors. On Wed, Aug 17, 2016 5:56 PM, neil90 neilp1...@icloud.com wrote: >From the spark documentation(http://spark.apache.org/d

Re: error when running spark from oozie launcher

2016-08-17 Thread Olivier Girardot
this is not the full stacktrace, please post the full stacktrace if you want some help On Wed, Aug 17, 2016 7:24 PM, tkg_cangkul yuza.ras...@gmail.com wrote: hi i try to submit job spark with oozie. but i've got one problem here. when i submit the same job. sometimes my job succeed but sometim

Re: Spark SQL 1.6.1 issue

2016-08-17 Thread Olivier Girardot
your executors/driver must not have the multiple versions of spark in classpath, it may come from the cassandra connector check the pom dependencies of the version you fetched and if it's compatible with your spark version. On Thu, Aug 18, 2016 6:05 AM, thbeh th...@thbeh.com wrote: Running the

Spark SQL 1.6.1 issue

2016-08-17 Thread thbeh
Running the query below I have been hitting - local class incompatible exception, anyone know the cause? val rdd = csc.cassandraSql("""select *, concat('Q', d_qoy) as qoy from store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on ss_item_sk = i_item_sk""").groupBy("i_category").piv

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread Divya Gehlot
Can you please check order of all the data set of union all operations. Are they in same order ? On 9 August 2016 at 02:47, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it. > But I get this exception - > > Exception in t

RE: pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz
So here is the test case from the commit adding the first/last methods here: https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162 + test("last/first with ignoreNulls") { +val nullStr: String = null +val df = Seq( + ("a", 0, nullStr),

Re: How to combine two DStreams(pyspark)?

2016-08-17 Thread ayan guha
Wondering why are you creating separate dstreams? You should apply the logic directly on input dstream On 18 Aug 2016 06:40, "vidhan" wrote: > I have a *kafka* stream coming in with some input topic. > This is the code i wrote for accepting *kafka* stream. > > *>>> conf = SparkConf().setAppName(a

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread max square
Thanks Harsh for the reply. When I change the code to something like this - def saveAsLatest(df: DataFrame, fileSystem: FileSystem, bakDir: String) = { fileSystem.rename(new Path(bakDir + latest), new Path(bakDir + "/" + ScalaUtil.currentDateTimeString)) fileSystem.create(new Path(bak

How to combine two DStreams(pyspark)?

2016-08-17 Thread vidhan
I have a *kafka* stream coming in with some input topic. This is the code i wrote for accepting *kafka* stream. *>>> conf = SparkConf().setAppName(appname) >>> sc = SparkContext(conf=conf) >>> ssc = StreamingContext(sc) >>> kvs = KafkaUtils.createDirectStream(ssc, topics,\ {"metada

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-17 Thread Michał Zieliński
I'm using Spark 1.6.2 for Vector-based UDAF and this works: def inputSchema: StructType = new StructType().add("input", new VectorUDT()) Maybe it was made private in 2.0 On 17 August 2016 at 05:31, Alexey Svyatkovskiy wrote: > Hi Yanbo, > > Thanks for your reply. I will keep an eye on that pul

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Nisha Muktewar
The OneHotEncoder does *not* accept multiple columns. You can use Michal's suggestion where he uses Pipeline to set the stages and then executes them. The other option is to write a function that performs one hot encoding on a column and returns a dataframe with the encoded column and then call i

Re: Undefined function json_array_to_map

2016-08-17 Thread vr spark
Hi Ted/All, i did below to get fullstack and see below, not able to understand root cause.. except Exception as error: traceback.print_exc() and this what i get... File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql return DataFrame(self._ssql_ctx.

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
I had already tried this way : scala> val featureCols = Array("category","newone") featureCols: Array[String] = Array(category, newone) scala> val indexer = new StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1) :29: error: type mismatch; found : Array[String] re

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Nisha Muktewar
I don't think it does. From the documentation: https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder, I see that it still accepts one column at a time. On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty wrote: > 2.0: > > One hot encoding currently accepts single input column

Re: Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread Michał Zieliński
You can it just map over your columns and create a pipeline: val columns = Array("colA", "colB", "colC") val transformers: Array[PipelineStage] = columns.map { x => new OneHotEncoder().setInputCol(x).setOutputCol(x + "Encoded") } val pipeline = new Pipeline() .setStages(transformers) On 17 Au

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread HARSH TAKKAR
Hi I can see that exception is caused by following, csn you check where in your code you are using this path Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://testcluster:8020/experiments/vol/spark_chomp_data/bak/restaurants-bak/latest On Wed, 17 Aug 20

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread max square
/bump It'd be great if someone can point me to the correct direction. On Mon, Aug 8, 2016 at 5:07 PM, max square wrote: > Here's the complete stacktrace - https://gist.github.com/rohann/ > 649b0fcc9d5062ef792eddebf5a315c1 > > For reference, here's the complete function - > > def saveAsLatest(

error when running spark from oozie launcher

2016-08-17 Thread tkg_cangkul
hi i try to submit job spark with oozie. but i've got one problem here. when i submit the same job. sometimes my job succeed but sometimes my job was failed. i've got this error message when the job was failed : org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V

Extract year from string format of date

2016-08-17 Thread Selvam Raman
Spark Version : 1.5.0 Record: 01-Jan-16 Expected Result: 2016 I used the below code which is shared in user group. from_unixtime(unix_timestamp($"Creation Date","dd-MMM-yy"),"")) is this right approach or do we have any other approach. NOTE: i tried *year() *function but it gives only nu

Spark ML : One hot Encoding for multiple columns

2016-08-17 Thread janardhan shetty
2.0: One hot encoding currently accepts single input column is there a way to include multiple columns ?

[Community] Python support added to Spark Job Server

2016-08-17 Thread Evan Chan
Hi folks, Just a friendly message that we have added Python support to the REST Spark Job Server project. If you are a Python user looking for a RESTful way to manage your Spark jobs, please come have a look at our project! https://github.com/spark-jobserver/spark-jobserver -Evan ---

Re: UDF in SparkR

2016-08-17 Thread Yann-Aël Le Borgne
I experienced very slow execution time http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow and wondering why... On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung wrote: > This is supported in Spark 2.0.0 as dapply and gapply. Please see the API > doc: > https://spark.apache.or

Re: Attempting to accept an unknown offer

2016-08-17 Thread vr spark
My code is very simple, if i use other hive tables, my code works fine. This particular table (virtual view) is huge and might have more metadata. It has only two columns. virtual view name is : cluster_table # col_namedata_type ln string parti in

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
Please include user@ in your reply. Can you reveal the snippet of hive sql ? On Wed, Aug 17, 2016 at 9:04 AM, vr spark wrote: > spark 1.6.1 > mesos > job is running for like 10-15 minutes and giving this message and i killed > it. > > In this job, i am creating data frame from a hive sql. There

pyspark.sql.functions.last not working as expected

2016-08-17 Thread Alexander Peletz
Hi, I am using Spark 2.0 and I am getting unexpected results using the last() method. Has anyone else experienced this? I get the sense that last() is working correctly within a given data partition but not across the entire RDD. First() seems to work as expected so I can work around this by ha

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-17 Thread neil90
>From the spark documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence) yes you can use persist on a dataframe instead of cache. All cache is, is a shorthand for the default persist storage level "MEMORY_ONLY". If you want to persist the dataframe to disk you shoul

Attempting to accept an unknown offer

2016-08-17 Thread vr spark
W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492 W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493 W0816 23:17:01.985124 16360 sched.cpp

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
Can you provide more information ? Were you running on YARN ? Which version of Spark are you using ? Was your job failing ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > > W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an > unknown offer b859f2f3-7484-482d-8c0d-3

Re: Undefined function json_array_to_map

2016-08-17 Thread Ted Yu
Can you show the complete stack trace ? Which version of Spark are you using ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > Hi, > I am getting error on below scenario. Please suggest. > > i have a virtual view in hive > > view name log_data > it has 2 columns > > query_map

Undefined function json_array_to_map

2016-08-17 Thread vr spark
Hi, I am getting error on below scenario. Please suggest. i have a virtual view in hive view name log_data it has 2 columns query_map map parti_date int Here is my snippet for the spark data frame my dataframe res=sqlcont.sql("select parti_date FROM log_data WHERE par

Aggregations with scala pairs

2016-08-17 Thread Andrés Ivaldi
Hello, I'd like to report a wrong behavior of DataSet's API, I don´t know how I can do that. My Jira account doesn't allow me to add a Issue I'm using Apache 2.0.0 but the problem came since at least version 1.4 (given the doc since 1.3) The problem is simple to reporduce, also the work arround,

Re: UDF in SparkR

2016-08-17 Thread Felix Cheung
This is supported in Spark 2.0.0 as dapply and gapply. Please see the API doc: https://spark.apache.org/docs/2.0.0/api/R/ Feedback welcome and appreciated! _ From: Yogesh Vyas mailto:informy...@gmail.com>> Sent: Tuesday, August 16, 2016 11:39 PM Subject: UDF in SparkR

How to implement a InputDStream like the twitter stream in Spark?

2016-08-17 Thread Xi Shen
Hi, First I am not sure if I should inherit from InputDStream, or ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a receiver on each worker nodes? If I want to inherit InputDStream, what should I do in the comput() method? -- Thanks, David S.

Spark standalone or Yarn for resourcing

2016-08-17 Thread Ashok Kumar
Hi, for small to medium size clusters I think Spark Standalone mode is a good choice. We are contemplating moving to Yarn as our cluster grows.  What are the pros and cons of using either please. Which one offers the best Thanking you

Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-17 Thread luohui20001
Hello guys: I have a problem in loading recommend model. I have 2 models, one is good(able to get recommend result) and another is not working. I checked these 2 models, both are MatrixFactorizationModel object. But in the metadata, one is a PipelineModel and another is a MatrixFactorizatio