[TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-28 Thread Palash Gupta
Hi Apache Spark User team, Greetings! I started developing an application using Apache Hadoop and Spark using python. My pyspark application randomly terminated saying "Failed to get broadcast_1*" and I have been searching for suggestion and support in Stakeoverflow at Failed to get

Re: Spark streaming with Yarn: executors not fully utilized

2016-12-28 Thread Nishant Kumar
Any update on this guys ? On Wed, Dec 28, 2016 at 10:19 AM, Nishant Kumar wrote: > I have updated my question: > > http://stackoverflow.com/questions/41345552/spark- > streaming-with-yarn-executors-not-fully-utilized > > On Wed, Dec 28, 2016 at 9:49 AM, Nishant Kumar

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Miguel Morales
Hi Not sure about Spring boot but trying to use DI libraries you'll run into serialization issues.I've had luck using an old version of Scaldi. Recently though I've been passing the class types as arguments with default values. Then in the spark code it gets instantiated. So you're

Invert large matrix

2016-12-28 Thread Yanwei Wayne Zhang
Hi all, I have a matrix X stored as RDD[SparseVector] that is high dimensional, say 800 million rows and 2 million columns, and more 95% of the entries are zero. Is there a way to invert (X'X + eye) efficiently, where X' is the transpose of X and eye is the identity matrix? I am thinking of

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
If you're using Kubernetes you can group spark and hdfs to run in the same stack. Meaning they'll basically run in the same network space and share ips. Just gotta make sure there's no port conflicts. On Wed, Dec 28, 2016 at 5:07 AM, Karamba wrote: > > Good idea, thanks! > >

org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@20b7e9d2)

2016-12-28 Thread prayag
Hi Guys, I have a simple spark job: " df = spark.read.csv(fpath, header=True, inferSchema=False) def map_func(line): map_keys = tuple([line['key1"] + [line[k] for k in KEYS]) return map_keys, line d = df.rdd.map(lmap_func).groupByKey()

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Lars Albertsson
Do you really need dependency injection? DI is often used for testing purposes. Data processing jobs are easy to test without DI, however, due to their functional and synchronous nature. Hence, DI is often unnecessary for testing data processing jobs, whether they are batch or streaming jobs. Or

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Justin Miller
Interesting, because a bug that seemed to be fixed in 2.1.0-SNAPSHOT doesn't appear to be fixed in 2.1.0 stable (it centered around a null-pointer exception during code gen). It seems to be fixed in 2.1.1-SNAPSHOT, but I can try stable again. > On Dec 28, 2016, at 1:38 PM, Mark Hamstra

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
A SNAPSHOT build is not a stable artifact, but rather floats to the top of commits that are intended for the next release. So, 2.1.1-SNAPSHOT comes after the 2.1.0 release and contains any code at the time that the artifact was built that was committed to the branch-2.1 maintenance branch and is,

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Koert Kuipers
ah yes you are right. i must not have fetched correctly earlier On Wed, Dec 28, 2016 at 2:53 PM, Mark Hamstra wrote: > The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0 > > On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers wrote: > >>

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Justin Miller
It looks like the jars for 2.1.0-SNAPSHOT are gone? https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.1.0-SNAPSHOT/ Also: 2.1.0-SNAPSHOT/

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Mark Hamstra
The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0 On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers wrote: > seems like the artifacts are on maven central but the website is not yet > updated. > > strangely the tag v2.1.0 is not yet available on github. i

Spark/Mesos with GPU support

2016-12-28 Thread Ji Yan
Dear Spark Users, Has anyone had successful experience running Spark on Mesos with GPU support? We have a Mesos cluster that can see and offer nvidia GPU resources. With Spark, it seems that the GPU support with Mesos (https://github.com/apache/spark/pull/14644

Re: [PySpark - 1.6] - Avoid object serialization

2016-12-28 Thread Chawla,Sumit
Would this work for you? def processRDD(rdd): analyzer = ShortTextAnalyzer(root_dir) rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])) ssc.union(*streams).filter(lambda x: x[1] != None) .foreachRDD(lambda rdd: processRDD(rdd)) Regards Sumit Chawla On Wed, Dec

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Koert Kuipers
seems like the artifacts are on maven central but the website is not yet updated. strangely the tag v2.1.0 is not yet available on github. i assume its equal to v2.1.0-rc5 On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller < justin.mil...@protectwise.com> wrote: > I'm curious about this as well.

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Resolved above error by creating SparkSession val spark = SparkSession.builder().appName("Hbase - Spark POC").getOrCreate() Error after: spark.sql("SELECT * FROM student").show() But while doing show() action on Dataframe throws below error: scala> sqlContext.sql("select * from

[PySpark - 1.6] - Avoid object serialization

2016-12-28 Thread Sidney Feiner
Hey, I just posted this question on Stack Overflow (link here) and decided to try my luck here as well :) I'm writing a PySpark job but I got into some performance issues. Basically, all it does is

Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Hello Spark Community, I am reading HBase table from Spark and getting RDD but now i wants to convert RDD of Spark Rows and want to convert to DF. *Source Code:* bin/spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf spark.hbase.host=127.0.0.1 import

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Good idea, thanks! But unfortunately that's not possible. All containers are connected to an overlay network. Is there any other possiblity to say spark that it is on the same *NODE* as an hdfs data node? On 28.12.2016 12:00, Miguel Morales wrote: > It might have to do with your container

Getting values list per partition

2016-12-28 Thread Rohit Verma
Hi I am trying something like final Dataset df = spark.read().csv("src/main/resources/star2000.csv").select("_c1").as(Encoders.STRING()); final Dataset arrayListDataset = df.mapPartitions(new MapPartitionsFunction() { @Override public Iterator

Re: Apache Hive with Spark Configuration

2016-12-28 Thread Gourav Sengupta
Hi, I think that you can configure the hive metastore versions in SPARK. Regards, Gourav On Wed, Dec 28, 2016 at 12:22 PM, Chetan Khatri wrote: > Hello Users / Developers, > > I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which > version is

Apache Hive with Spark Configuration

2016-12-28 Thread Chetan Khatri
Hello Users / Developers, I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which version is more compatible with Spark 2.0.2 ? THanks

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
It might have to do with your container ips, it depends on your networking setup. You might want to try host networking so that the containers share the ip with the host. On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote: > > Hi Sun Rui, > > thanks for answering! > > >> Although

Re: how to integrate Apache Kafka with spark ?

2016-12-28 Thread Tushar Adeshara
Please see below links depending on version of Spark 2.x http://spark.apache.org/docs/latest/streaming-kafka-integration.html Spark Streaming + Kafka Integration Guide - Spark 2.0.2 ... spark.apache.org Spark Streaming +

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Hi Sun Rui, thanks for answering! > Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. This explains why the script that I configured in core-site.xml topology.script.file.name is not called in by the spark

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-28 Thread Raju Bairishetti
Try setting num partitions to (number of executors * number of cores) while writing to dest location. You should be very very careful while setting num partitions as incorrect number may lead to shuffle. On Fri, Dec 16, 2016 at 12:56 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: >