Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Pralabh Kumar
ng Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumar >

Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Jacek Laskowski
tured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Thu, Feb 8, 2018 at 7:25 AM, Pralabh Kumar wrote: > Hi > > Spark 2.0 doesn't support stored by . Is there any alternative to achieve > the same. > > >

Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-07 Thread Pralabh Kumar
Hi Spark 2.0 doesn't support stored by . Is there any alternative to achieve the same.

Re: spark 2.0 and spark 2.2

2018-01-22 Thread Xiao Li
. Thanks, Xiao 2018-01-22 7:07 GMT-08:00 Mihai Iacob : > Does spark 2.2 have good backwards compatibility? Is there something that > won't work that works in spark 2.0? > > > Regards, > > *Mihai Iacob* > DSX Local <https://datascience.ibm.com/

spark 2.0 and spark 2.2

2018-01-22 Thread Mihai Iacob
Does spark 2.2 have good backwards compatibility? Is there something that won't work that works in spark 2.0?   Regards,  Mihai IacobDSX Local - Sec

Re: Spark 2.0 and Oracle 12.1 error

2017-07-24 Thread Cassa L
ote: > >> Could you share the schema of your Oracle table and open a JIRA? >> >> Thanks! >> >> Xiao >> >> >> 2017-07-21 9:40 GMT-07:00 Cassa L : >> >>> I am using 2.2.0. I resolved the problem by removing SELECT * and adding >&

Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Cassa L
at 10:12 AM, Xiao Li wrote: > Could you share the schema of your Oracle table and open a JIRA? > > Thanks! > > Xiao > > > 2017-07-21 9:40 GMT-07:00 Cassa L : > >> I am using 2.2.0. I resolved the problem by removing SELECT * and adding >> column names

Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Xiao Li
t 11:10 PM Cassa L wrote: >> >>> Hi, >>> I am trying to use Spark to read from Oracle (12.1) table using Spark >>> 2.0. My table has JSON data. I am getting below exception in my code. Any >>> clue? >>> >>> >>>>> >>&g

Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Cassa L
ssues in the latest > release. > > Thanks > > Xiao > > > On Wed, 19 Jul 2017 at 11:10 PM Cassa L wrote: > >> Hi, >> I am trying to use Spark to read from Oracle (12.1) table using Spark >> 2.0. My table has JSON data. I am getting below exception in my co

Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Xiao Li
Could you try 2.2? We fixed multiple Oracle related issues in the latest release. Thanks Xiao On Wed, 19 Jul 2017 at 11:10 PM Cassa L wrote: > Hi, > I am trying to use Spark to read from Oracle (12.1) table using Spark 2.0. > My table has JSON data. I am getting below exception i

Re: Spark 2.0 and Oracle 12.1 error

2017-07-19 Thread ayan guha
: > Hi, > I am trying to use Spark to read from Oracle (12.1) table using Spark 2.0. > My table has JSON data. I am getting below exception in my code. Any clue? > > >>>>> > java.sql.SQLException: Unsupported type -101 > > at org.apache.spark.sql.execution.da

Spark-2.0 and Oracle 12.1 error: Unsupported type -101

2017-07-19 Thread Cassa L
Hi, I am trying to read data into Spark from Oracle using ojdb7 driver. The data is in JSON format. I am getting below error. Any idea on how to resolve it? ava.sql.SQLException: Unsupported type -101 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$dat

Spark 2.0 and Oracle 12.1 error

2017-07-19 Thread Cassa L
Hi, I am trying to use Spark to read from Oracle (12.1) table using Spark 2.0. My table has JSON data. I am getting below exception in my code. Any clue? >>>>> java.sql.SQLException: Unsupported type -101 at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$a

command to get list oin spark 2.0 scala of all persisted rdd's in spark 2.0 scala shell

2017-06-01 Thread nancy henry
Hi Team, Please let me know how to get list of all persisted RDD's ins park 2.0 shell Regards, Nancy

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-21 Thread Dongjin Lee
Hi Chetan, Sadly, you can not; Spark is configured to ignore the null values when writing JSON. (check JacksonMessageWriter and find JsonInclude.Include.NON_NULL from the code.) If you want that functionality, it would be much better to file the problem to JIRA. Best, Dongjin On Mon, Mar 20, 201

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
Exactly. On Sat, Mar 11, 2017 at 1:35 PM, Dongjin Lee wrote: > Hello Chetan, > > Could you post some code? If I understood correctly, you are trying to > save JSON like: > > { > "first_name": "Dongjin", > "last_name: null > } > > not in omitted form, like: > > { > "first_name": "Dongjin" >

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-11 Thread Dongjin Lee
Hello Chetan, Could you post some code? If I understood correctly, you are trying to save JSON like: { "first_name": "Dongjin", "last_name: null } not in omitted form, like: { "first_name": "Dongjin" } right? - Dongjin On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri wrote: > Hello Dev

Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
How about running this - select * from (select * , count() over (partition by id order by id) c from filteredDS) f where f.cnt < 7500 On Sun, Mar 5, 2017 at 12:05 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Yes every time I run this code with production scale data it fails. Test

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread Ankur Srivastava
Yes every time I run this code with production scale data it fails. Test case with small dataset of 50 records on local box runs fine. Thanks Ankur Sent from my iPhone > On Mar 4, 2017, at 12:09 PM, ayan guha wrote: > > Just to be sure, can you reproduce the error using sql api? > >> On Sat,

Re: Spark 2.0 issue with left_outer join

2017-03-04 Thread ayan guha
Just to be sure, can you reproduce the error using sql api? On Sat, 4 Mar 2017 at 2:32 pm, Ankur Srivastava wrote: > Adding DEV. > > Or is there any other way to do subtractByKey using Dataset APIs? > > Thanks > Ankur > > On Wed, Mar 1, 2017 at 1:28 PM, Ankur Srivastava < > ankur.srivast...@gmai

Re: Spark 2.0 issue with left_outer join

2017-03-03 Thread Ankur Srivastava
Adding DEV. Or is there any other way to do subtractByKey using Dataset APIs? Thanks Ankur On Wed, Mar 1, 2017 at 1:28 PM, Ankur Srivastava wrote: > Hi Users, > > We are facing an issue with left_outer join using Spark Dataset api in 2.0 > Java API. Below is the code we have > > Dataset badIds

Spark 2.0 issue with left_outer join

2017-03-01 Thread Ankur Srivastava
Hi Users, We are facing an issue with left_outer join using Spark Dataset api in 2.0 Java API. Below is the code we have Dataset badIds = filteredDS.groupBy(col("id").alias("bid")).count() .filter((FilterFunction) row -> (Long) row.getAs("count") > 75000); _logger.info("Id count with over

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Bill Schwanitz
Subhash, Yea that did the trick thanks! On Wed, Mar 1, 2017 at 12:20 PM, Subhash Sriram wrote: > If I am understanding your problem correctly, I think you can just create > a new DataFrame that is a transformation of sample_data by first > registering sample_data as a temp table. > > //Register

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Subhash Sriram
If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed va

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Marco Mistroni
Hi I think u need an UDF if u want to transform a column Hth On 1 Mar 2017 4:22 pm, "Bill Schwanitz" wrote: > Hi all, > > I'm fairly new to spark and scala so bear with me. > > I'm working with a dataset containing a set of column / fields. The data > is stored in hdfs as parquet and is sour

question on transforms for spark 2.0 dataset

2017-03-01 Thread Bill Schwanitz
Hi all, I'm fairly new to spark and scala so bear with me. I'm working with a dataset containing a set of column / fields. The data is stored in hdfs as parquet and is sourced from a postgres box so fields and values are reasonably well formed. We are in the process of trying out a switch from pe

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread arun kumar Natva
perform some joins, aggregations and finally >> generate a dense vector to perform analytics. >> >> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the >> same >> code is migrated to run on spark 2.0 on the same cluster, it takes around >> 4-

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Jörn Franke
and finally >> generate a dense vector to perform analytics. >> >> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same >> code is migrated to run on spark 2.0 on the same cluster, it takes around >> 4-5 hours. It is surprising and frustrating.

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Timur Shenkao
and perform some joins, aggregations and finally > generate a dense vector to perform analytics. > > The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same > code is migrated to run on spark 2.0 on the same cluster, it takes around > 4-5 hours. It is surprising

My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread anatva
Hi, I am reading an ORC file, and perform some joins, aggregations and finally generate a dense vector to perform analytics. The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same code is migrated to run on spark 2.0 on the same cluster, it takes around 4-5 hours. It is

Re: Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread Cody Koeninger
Pretty sure there was no 0.10.0.2 release of apache kafka. If that's a hortonworks modified version you may get better results asking in a hortonworks specific forum. Scala version of kafka shouldn't be relevant either way though. On Wed, Feb 8, 2017 at 5:30 PM, u...@moosheimer.com wrote: > Dea

Spark 2.0 Scala 2.11 and Kafka 0.10 Scala 2.10

2017-02-08 Thread u...@moosheimer.com
Dear devs, is it possible to use Spark 2.0.2 Scala 2.11 and consume messages from kafka server 0.10.0.2 running on Scala 2.10? I tried this the last two days by using createDirectStream and can't get no message out of kafka?! I'm using HDP 2.5.3 running kafka_2.10-0.10.0.2.5.3.0-37 and Spark 2.0.

RE: Jars directory in Spark 2.0

2017-02-01 Thread Sidney Feiner
Feiner Cc: Koert Kuipers ; user@spark.apache.org Subject: Re: Jars directory in Spark 2.0 Spark has never shaded dependencies (in the sense of renaming the classes), with a couple of exceptions (Guava and Jetty). So that behavior is nothing new. Spark's dependencies themselves have a l

Re: Jars directory in Spark 2.0

2017-02-01 Thread Marcelo Vanzin
7720 <+972%2052-819-7720> */* Skype: sidney.feiner.startapp > > > > [image: StartApp] <http://www.startapp.com/> > > > > *From:* Koert Kuipers [mailto:ko...@tresata.com] > *Sent:* Tuesday, January 31, 2017 7:26 PM > *To:* Sidney Feiner > *Cc:* user@spark.apache.o

RE: Jars directory in Spark 2.0

2017-01-31 Thread Sidney Feiner
ert Kuipers [mailto:ko...@tresata.com] Sent: Tuesday, January 31, 2017 7:26 PM To: Sidney Feiner Cc: user@spark.apache.org Subject: Re: Jars directory in Spark 2.0 you basically have to keep your versions of dependencies in line with sparks or shade your own dependencies. you cannot just replace

Re: Jars directory in Spark 2.0

2017-01-31 Thread Koert Kuipers
you basically have to keep your versions of dependencies in line with sparks or shade your own dependencies. you cannot just replace the jars in sparks jars folder. if you wan to update them you have to build spark yourself with updated dependencies and confirm it compiles, passes tests etc. On T

Jars directory in Spark 2.0

2017-01-31 Thread Sidney Feiner
Hey, While migrating to Spark 2.X from 1.6, I've had many issues with jars that come preloaded with Spark in the "jars/" directory and I had to shade most of my packages. Can I replace the jars in this folder to more up to date versions? Are those jar used for anything internal in Spark which me

Re: Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
sorry. should have done more research before jumping to the list the version of the connector is 2.0.0, available from maven repors sorry On Mon, Jan 16, 2017 at 9:32 PM, Marco Mistroni wrote: > HI all > in searching on how to use Spark 2.0 with mongo i came across this link >

Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
HI all in searching on how to use Spark 2.0 with mongo i came across this link https://jira.mongodb.org/browse/SPARK-20 i amended my build.sbt (content below), however the mongodb dependency was not found Could anyone assist? kr marco name := "SparkExamples" version := "1.

Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
.r.t. stability? > > > > Regards, > > Ankur > > > > *From:* Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] > *Sent:* Monday, January 09, 2017 5:02 PM > *To:* Ankur Jain > *Cc:* user@spark.apache.org > *Subject:* Re: Machine Learning in Spark 1.

RE: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Thanks Rezaul… Is Spark 2.1.0 still have any issues w.r.t. stability? Regards, Ankur From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] Sent: Monday, January 09, 2017 5:02 PM To: Ankur Jain Cc: user@spark.apache.org Subject: Re: Machine Learning in Spark 1.6 vs Spark 2.0 Hello

Re: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Md. Rezaul Karim
Hello Jain, I would recommend using Spark MLlib (and ML) of *Spark 2.1.0* with the following features: - ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering - Featurizatio

Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Hi Team, I want to start a new project with ML. But wanted to know which version of Spark is much stable and have more features w.r.t ML Please suggest your opinion... Thanks in Advance... [cid:image013.png@01D1AAE2.28F7BBF0] Thanks & Regards Ankur Jain Technical Architect - Big Data | IoT | I

Re: Kafka 0.8 + Spark 2.0 Partition Issue

2017-01-06 Thread Cody Koeninger
Kafka is designed to only allow reads from leaders. You need to fix this at the kafka level not the spark level. On Fri, Jan 6, 2017 at 7:33 AM, Raghu Vadapalli wrote: > > My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset > exception. When I check the kafka

Kafka 0.8 + Spark 2.0 Partition Issue

2017-01-06 Thread Raghu Vadapalli
My spark 2.0 + kafka 0.8 streaming job fails with error partition leaderset exception. When I check the kafka topic the partition, it is indeed in error with Leader = -1 and empty ISR. I did lot of google and all of them point to either restarting or deleting the topic. To do any of those

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
Spark 2.0.0 introduced "Automatic file coalescing for native data sources" ( http://spark.apache.org/releases/spark-release-2-0-0.html#performance-and-runtime). Perhaps that is the cause? I'm not sure if this feature is mentioned anywhere in the documentation or if there's any way to disable it.

Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Kristina Rogale Plazonic
Hi, I write a randomly generated 30,000-row dataframe to parquet. I verify that it has 200 partitions (both in Spark and inspecting the parquet file in hdfs). When I read it back in, it has 23 partitions?! Is there some optimization going on? (This doesn't happen in Spark 1.5) *How can I force i

Re: About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Calvin Jia
Hi, Alluxio will allow you to share or cache data in-memory between different Spark contexts by storing RDDs or Dataframes as a file in the Alluxio system. The files can then be accessed by any Spark job like a file in any other distributed storage system. These two blogs do a good job of summari

About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Chetan Khatri
Hello Guys, What would be approach to accomplish Spark Multiple Shared Context without Alluxio and with with Alluxio , and what would be best practice to achieve parallelism and concurrency for spark jobs. Thanks. -- Yours Aye, Chetan Khatri. M.+91 7 80574 Data Science Researcher INDIA ​​S

Fwd: [Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread w.zhaokang
Hi all, In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid shuffle thus bringing us high join performance. In the new Dataset API in Spark 2.0, is the high performance shuffle-free join by co-partition mechanism still feasible? I have looked through the API doc but

[Spark Dataset]: How to conduct co-partition join in the new Dataset API in Spark 2.0

2016-12-01 Thread Dale Wang
Hi all, In the old Spark RDD API, key-value PairRDDs can be co-partitioned to avoid shuffle thus bringing us high join performance. In the new Dataset API in Spark 2.0, is the high performance shuffle-free join by co-partition mechanism still feasible? I have looked through the API doc but

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Reynold Xin
gupta >> >> On Wed, Nov 30, 2016 at 5:35 PM, Yin Huai wrote: >> >>> Hello Michael, >>> >>> Thank you for reporting this issue. It will be fixed by >>> https://github.com/apache/spark/pull/16080. >>> >>> Thanks, >>>

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Timur Shenkao
t; >> Yin >> >> On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao >> wrote: >> >>> Hi! >>> >>> Do you have real HIVE installation? >>> Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive >>> -Phive-thriftserver )

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Gourav Sengupta
t will be fixed by > https://github.com/apache/spark/pull/16080. > > Thanks, > > Yin > > On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao wrote: > >> Hi! >> >> Do you have real HIVE installation? >> Have you built Spark 2.1 & Spark 2.0 with HIVE suppor

SPARK 2.0 CSV exports (https://issues.apache.org/jira/browse/SPARK-16893)

2016-11-30 Thread Gourav Sengupta
Hi Sean, I think that the main issue was users importing the package while starting SPARK just like the way we used to do in SPARK 1.6. After removing that option from --package while starting SPARK 2.0 the issue of conflicting libraries disappeared. I have written about this in https

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-30 Thread Yin Huai
Hello Michael, Thank you for reporting this issue. It will be fixed by https://github.com/apache/spark/pull/16080. Thanks, Yin On Tue, Nov 29, 2016 at 11:34 PM, Timur Shenkao wrote: > Hi! > > Do you have real HIVE installation? > Have you built Spark 2.1 & Spark 2.0 w

Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Timur Shenkao
Hi! Do you have real HIVE installation? Have you built Spark 2.1 & Spark 2.0 with HIVE support ( -Phive -Phive-thriftserver ) ? It seems that you use "default" Spark's HIVE 1.2.1. Your metadata is stored in local Derby DB which is visible to concrete Spark installation but n

Pasting oddity with Spark 2.0 (scala)

2016-11-14 Thread jggg777
This one has stumped the group here, hoping to get some insight into why this error is happening. I'm going through the Databricks DataFrames scala docs

Hive Queries are running very slowly in Spark 2.0

2016-11-09 Thread Jaya Shankar Vadisela
Hi ALL I have below simple HIVE Query, we have a use-case where we will run multiple HIVE queries in parallel, in our case it is 16 (num of cores in our machine, using scala PAR array). In Spark 1.6 it is executing in 10 secs but in Spark 2.0 same queries are taking 5 mins. "select * fro

Hive Queries are running very slowly in Spark 2.0

2016-11-09 Thread Jaya Shankar Vadisela
Hi ALL I have below simple HIVE Query, we have a use-case where we will run multiple HIVE queries in parallel, in our case it is 16 (num of cores in our machine, using scala PAR array). In Spark 1.6 it is executing in 10 secs but in Spark 2.0 same queries are taking 5 mins. "select * fro

Re: Do you use spark 2.0 in work?

2016-10-31 Thread Michael Armbrust
Yes, all of our production pipelines have been ported to Spark 2.0. On Mon, Oct 31, 2016 at 1:16 AM, Yang Cao wrote: > Hi guys, > > Just for personal interest. I wonder whether spark 2.0 a productive > version? Is there any company use this version as its main version in daily

Re: Do you use spark 2.0 in work?

2016-10-31 Thread Andy Dang
This is my personal email so I can't exactly discuss work-related topics. But yes, many teams in my company use Spark 2.0 in production environment. What are the challenges that prevent you from adopting it (besides migration from Spark 1.x)? --- Regards, Andy On Mon, Oct 31, 2016 at

Do you use spark 2.0 in work?

2016-10-31 Thread Yang Cao
Hi guys, Just for personal interest. I wonder whether spark 2.0 a productive version? Is there any company use this version as its main version in daily work? THX - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0 with Hadoop 3.0?

2016-10-30 Thread adam kramer
Hadoop 3.0 is a non-starter for use with Spark > 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which > would resolve our driver dependency issues. > > > what version problems are you having there? > > > There's a patch to move to AWS SDK 10.10, but that ha

Re: Spark 2.0 with Hadoop 3.0?

2016-10-29 Thread Steve Loughran
On 27 Oct 2016, at 23:04, adam kramer mailto:ada...@gmail.com>> wrote: Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases? Is there any reason why Hadoop 3.0 is a non-starter for use with Spark 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which

Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Zoltán Zvara
API-wise. > > On Thu, Oct 27, 2016 at 11:04 PM adam kramer wrote: > > Is the version of Spark built for Hadoop 2.7 and later only for 2.x > releases? > > Is there any reason why Hadoop 3.0 is a non-starter for use with Spark > 2.0? The version of aws-sdk in 3.0 actually work

Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Sean Owen
> > Is there any reason why Hadoop 3.0 is a non-starter for use with Spark > 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which > would resolve our driver dependency issues. > > Thanks, > Adam > >

Spark 2.0 with Hadoop 3.0?

2016-10-27 Thread adam kramer
Is the version of Spark built for Hadoop 2.7 and later only for 2.x releases? Is there any reason why Hadoop 3.0 is a non-starter for use with Spark 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which would resolve our driver dependency issues. Thanks, Adam

Spark 2.0 on HDP

2016-10-27 Thread Deenar Toraskar
Hi Has anyone tried running Spark 2.0 on HDP. I have managed to get around the issues with the timeline service (by turning it off), but now am stuck when the YARN cannot find org.apache.spark.deploy.yarn.ExecutorLauncher. Error: Could not find or load main class

RE: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Mendelson, Assaf
PM To: Antoaneta Marinova Cc: user Subject: Re: Spark 2.0 - DataFrames vs Dataset performance Hi Antoaneta, I believe the difference is not due to Datasets being slower (DataFrames are just an alias to Datasets now), but rather using a user defined function for filtering vs using Spark builtins

Re: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Daniel Darabos
I'm wrong. On Mon, Oct 24, 2016 at 2:50 PM, Antoaneta Marinova < antoaneta.vmarin...@gmail.com> wrote: > Hello, > > I am using Spark 2.0 for performing filtering, grouping and counting > operations on events data saved in parquet files. As the events schema has > very neste

Accessing Phoenix table from Spark 2.0., any cure!

2016-10-24 Thread Mich Talebzadeh
My stack is this Spark: Spark 2.0.0 Zookeeper: ZooKeeper 3.4.6 Hbase: hbase-1.2.3 Phoenix: apache-phoenix-4.8.1-HBase-1.2-bin I am running this simple code scala> val df = sqlContext.load("org.apache.phoenix.spark", | Map("table" -> "MARKETDATAHBASE", "zkUrl" -> "rhes564:2181") | ) ja

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
What matters in this case is how many vcores YARN thinks it can allocate per machine. I think the relevant setting is yarn.nodemanager.resource.cpu-vcores. I bet you'll find this is actually more than the machine's number of cores, possibly on purpose, to enable some over-committing. On Mon, Oct 2

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
If you're really sure that 4 executors are on 1 machine, then it means your resource manager allowed it. What are you using, YARN? check that you really are limited to 40 cores per machine in the YARN config. On Mon, Oct 24, 2016 at 3:33 PM TheGeorge1918 . wrote: > Hi all, > > I'm deeply confuse

Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Antoaneta Marinova
Hello, I am using Spark 2.0 for performing filtering, grouping and counting operations on events data saved in parquet files. As the events schema has very nested structure I wanted to read them as scala beans to simplify the code but I noticed a severe performance degradation. Below you can find

Using a Custom Data Store with Spark 2.0

2016-10-24 Thread Sachith Withana
Hi all, I have a requirement to integrate a custom data store to be used with Spark ( v2.0.1). It consists of structured data in tables along with the schemas. Then I want to run SparkSQL queries on the data and provide the data back to the data service. I'm wondering what would be the best way

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-21 Thread Cody Koeninger
; > Again, kafkacat is running fine on the same node. >> >> >> >> > >> >> >> >> > 16/09/07 16:00:00 INFO Executor: Running task 1.0 in stage >> >> >> >> > 138.0 >> >> >> >> > (TID >>

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-21 Thread Srikanth
> 2 > >> >> >> > offsets 57079162 -> 57090330 > >> >> >> > 16/09/07 16:00:00 INFO KafkaRDD: Computing topic mt_event, > >> >> >> > partition > >> >> >> > 0 > >> >> >> > offs

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Cody Koeninger
Finished task 3.0 in stage 138.0 >> >> >> > (TID >> >> >> > 7851). 1030 bytes result sent to driver >> >> >> > 16/09/07 16:00:02 ERROR Executor: Exception in task 1.0 in stage >> >> >> > 138.0 >> &

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-20 Thread Srikanth
afkaConsumer.scala:74) > >> >> > at > >> >> > > >> >> > > >> >> > org.apache.spark.streaming.kafka010.KafkaRDD$ > KafkaRDDIterator.next(KafkaRDD.scala:227) > >> >> > at > >> >> > &g

Re: Mlib RandomForest (Spark 2.0) predict a single vector

2016-10-20 Thread jglov
I would also like to know if there is a way to predict a single vector with the new spark.ml API, although in my case it's because I want to do this within a map() to avoid calling groupByKey() after a flatMap(): *Current code (pyspark):* % Given 'model', 'rdd', and a function 'split_element' tha

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-19 Thread Cody Koeninger
/07 16:00:02 INFO CoarseGrainedExecutorBackend: Got assigned >> >> > task >> >> > 7854 >> >> > 16/09/07 16:00:02 INFO Executor: Running task 1.1 in stage 138.0 (TID >> >> > 7854) >> >> > 16/09/07 16:00:02 INFO KafkaRDD: Compu

Re: Spark 2.0 with Kafka 0.10 exception

2016-10-19 Thread Srikanth
gt; > > >> > 16/09/07 16:00:03 INFO Executor: Finished task 1.1 in stage 138.0 (TID > >> > 7854). 1103 bytes result sent to driver > >> > > >> > > >> > > >> > On Wed, Aug 24, 2016 at 2:13 PM, Srikanth > wrote: > >> >

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread shankinson
Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson
Hi, We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-12 Thread Paul Stewart
Hi all, I am using Spark 2.0 to read a CSV file into a Dataset in Java. This works fine if i define the StructType with the StructField array ordered by hand. What I would like to do is use a bean class for both the schema and Dataset row type. For example, Dataset beanDS = spark.read

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread Cody Koeninger
to work on a PR to update the java examples in the >> docs for the 0-10 version, I'm happy to help. >> >> On Mon, Oct 10, 2016 at 10:34 AM, static-max >> wrote: >> > Hi, >> > >> > by following this article I managed to consume messages fro

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread static-max
nyone wants to work on a PR to update the java examples in the > docs for the 0-10 version, I'm happy to help. > > On Mon, Oct 10, 2016 at 10:34 AM, static-max > wrote: > > Hi, > > > > by following this article I managed to consume messages from Kafka 0.10 > in

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread Cody Koeninger
the java examples in the docs for the 0-10 version, I'm happy to help. On Mon, Oct 10, 2016 at 10:34 AM, static-max wrote: > Hi, > > by following this article I managed to consume messages from Kafka 0.10 in > Spark 2.0: > http://spark.apache.org/docs/latest/streaming-kafka-0

Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-10 Thread static-max
Hi, by following this article I managed to consume messages from Kafka 0.10 in Spark 2.0: http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html However, the Java examples are missing and I would like to commit the offset myself after processing the RDD. Does anybody have a

Apply UDF to SparseVector column in spark 2.0

2016-10-08 Thread abby
Hi, I am trying to apply a UDF to a column in a PySpark df containing SparseVectors (created using pyspark.ml.feature.IDF). Originally, I was trying to apply a more involved function, but am getting the same error with any application of a function. So for the sake of an example: udfSum = udf(lam

Fw: Issue with Spark Streaming with checkpointing in Spark 2.0

2016-10-07 Thread Arijit
Resending, not sure if had sent to user@spark.apache.org earlier. Thanks, Arijit From: Arijit Sent: Friday, October 7, 2016 6:06 PM To: user@spark.apache.org Subject: Issue with Spark Streaming with checkpointing in Spark 2.0 In a Spark Streaming sample

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-07 Thread Paul Stewart
When using the Encoder(Bean.class).schema() method to generate the StructType array of StructFields the class attributes are returned as a sorted list and not in the defined order within the Bean.class. This makes the schema unusable for reading from a CSV file where the ordering of the attribute

Re: Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Jakob Odersky
e, Oct 4, 2016 at 10:33 AM, Liren Ding wrote: > I just upgrade from Spark 1.6.1 to 2.0, and got an java compile error: > error: cannot access DeveloperApi > class file for org.apache.spark.annotation.DeveloperApi not found > > From the Spark 2.0 document > (https://spark.a

Re: Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Sean Owen
rApi not found* > > From the Spark 2.0 document ( > https://spark.apache.org/docs/2.0.0/api/java/overview-summary.html), the > package org.apache.spark.annotation is removed. Does anyone know if it's > moved to another package? Or how to call developerAPI with absence of the > annotation? Thanks. > > Cheers, > Liren > > >

Package org.apache.spark.annotation no longer exist in Spark 2.0?

2016-10-04 Thread Liren Ding
I just upgrade from Spark 1.6.1 to 2.0, and got an java compile error: *error: cannot access DeveloperApi* * class file for org.apache.spark.annotation.DeveloperApi not found* >From the Spark 2.0 document ( https://spark.apache.org/docs/2.0.0/api/java/overview-summary.html), the pack

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-02 Thread Marco Mistroni
thing's funny about your compile server. > It's not required anyway. > > On Sat, Oct 1, 2016 at 3:24 PM, Marco Mistroni > wrote: > > Hi guys > > sorry to annoy you on this but i am getting nowhere. So far i have > tried to > > build spark 2.0 on my l

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Sean Owen
"Compile failed via zinc server" Try shutting down zinc. Something's funny about your compile server. It's not required anyway. On Sat, Oct 1, 2016 at 3:24 PM, Marco Mistroni wrote: > Hi guys > sorry to annoy you on this but i am getting nowhere. So far i have tried

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-10-01 Thread Marco Mistroni
Hi guys sorry to annoy you on this but i am getting nowhere. So far i have tried to build spark 2.0 on my local laptop with no success so i blamed my laptop poor performance So today i fired off an EC2 Ubuntu 16.06 Instance and installed the following (i copy paste commands here) ubuntu@ip

  1   2   3   4   5   6   >