Re: Spark Graphx with Database

2016-12-30 Thread Felix Cheung
You might want to check out GraphFrames - to load database data (as Spark DataFrame) and build graphs with them https://github.com/graphframes/graphframes _ From: balaji9058 > Sent: Monday, December 26, 2016 9:27 PM

Re: RDD Location

2016-12-30 Thread Fei Hu
It will be very appreciated if you can give more details about why runJob function could not be called in getPreferredLocations() In the NewHadoopRDD class and HadoopRDD class, they get the location information from the inputSplit. But there may be an issue in NewHadoopRDD, because it generates

Re: Difference in R and Spark Output

2016-12-30 Thread Felix Cheung
Could you elaborate more on the huge difference you are seeing? From: Saroj C Sent: Friday, December 30, 2016 5:12:04 AM To: User Subject: Difference in R and Spark Output Dear All, For the attached input file, there is a huge difference

Re: launch spark on mesos within a docker container

2016-12-30 Thread Timothy Chen
It seems like it's getting offer decline calls, which seems like it's getting the offer calls and was able to reply. Can you turn on TRACE logging in Spark with the Mesos coarse grain scheduler and see if it says if it is processing the offers? Tim On Fri, Dec 30, 2016 at 2:35 PM, Ji Yan

[ML] Converting ml.DenseVector to mllib.Vector

2016-12-30 Thread Jason Wolosonovich
Hello All, I'm working through the Data Science with Scala course on Big Data University and it is not updated to work with Spark 2.0, so I'm adapting the code as I work through it, however I've finally run into something that is over my head. I'm new to Scala as well. When I run this code

Re: RDD Location

2016-12-30 Thread Sun Rui
You can’t call runJob inside getPreferredLocations(). You can take a look at the source code of HadoopRDD to help you implement getPreferredLocations() appropriately. > On Dec 31, 2016, at 09:48, Fei Hu wrote: > > That is a good idea. > > I tried add the following code to

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
Thanks Felix, I will try it tomorrow ~~~sent from my cell phone, sorry if there is any typo 2016年12月30日 下午10:08,"Felix Cheung" 写道: > Have you tried the spark-csv package? > > https://spark-packages.org/package/databricks/spark-csv > > > --

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
yes, I believe there should be a better way to handle my case. ~~~sent from my cell phone, sorry if there is any typo 2016年12月30日 下午10:09,"write2sivakumar@gmail" 写道: Hi Raymond, Your problem is to pass those 100 fields to .toDF() method?? Sent from my Samsung

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread theodondre
You can use the structtype and structfield approach or use the inferSchema approach. Sent from my T-Mobile 4G LTE Device Original message From: "write2sivakumar@gmail" Date: 12/30/16 10:08 PM (GMT-05:00) To: Raymond Xie

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Felix Cheung
Have you tried the spark-csv package? https://spark-packages.org/package/databricks/spark-csv From: Raymond Xie Sent: Friday, December 30, 2016 6:46:11 PM To: user@spark.apache.org Subject: How to load a big csv to dataframe in Spark 1.6

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread write2sivakumar@gmail
Hi Raymond, Your problem is to pass those 100 fields to .toDF() method?? Sent from my Samsung device Original message From: Raymond Xie Date: 31/12/2016 10:46 (GMT+08:00) To: user@spark.apache.org Subject: How to load a big csv to dataframe in

How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Raymond Xie
Hello, I see there is usually this way to load a csv to dataframe: sqlContext = SQLContext(sc) Employee_rdd = sc.textFile("\..\Employee.csv") .map(lambda line: line.split(",")) Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name']) Employee_df.show() However in my

Re: Dependency Injection and Microservice development with Spark

2016-12-30 Thread Muthu Jayakumar
Adding to Lars Albertsson & Miguel Morales, I am hoping to see how well scalameta would branch down into support for macros that can rid away sizable DI problems and for the reminder having a class type as args as Miguel Morales mentioned. Thanks, On Wed, Dec 28, 2016 at 6:41 PM, Miguel Morales

context.runJob() was suspended in getPreferredLocations() function

2016-12-30 Thread Fei Hu
Dear all, I tried to customize my own RDD. In the getPreferredLocations() function, I used the following code to query anonter RDD, which was used as an input to initialize this customized RDD: * val results: Array[Array[DataChunkPartition]] = context.runJob(partitionsRDD,

Re: RDD Location

2016-12-30 Thread Fei Hu
That is a good idea. I tried add the following code to get getPreferredLocations() function: val results: Array[Array[DataChunkPartition]] = context.runJob( partitionsRDD, (context: TaskContext, partIter: Iterator[DataChunkPartition]) => partIter.toArray, dd, allowLocal = true) But it

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Nicholas Hakobian
Yep, sequential joins is what I have done in the past with similar requirements. Splitting and merging DataFrames is most likely killing performance if you do not cache the DataFrame pre-split. If you do, it will compute the lineage prior to the cache statement once (at first invocation), then

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
Thanks Nicholas. It looks like for some of my use cases, I might be able to use do sequential joins, and then use coalesce() (or in combination with withColumn(when()...)) to sort out the results. Splitting and merging dataframes seems to really kills my app performance. I'm not sure if

Re: launch spark on mesos within a docker container

2016-12-30 Thread Ji Yan
Thanks Timothy, Setting these four environment variables as you suggested has got the Spark running LIBPROCESS_ADVERTISE_IP=LIBPROCESS_ADVERTISE_PORT=40286 LIBPROCESS_IP=0.0.0.0 LIBPROCESS_PORT=40286 After that, it seems that Spark cannot accept any offer from mesos. If I run the same script

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Marco Mistroni
Hi Palash so you have a pyspark application running on spark 2.0 You have python scripts dropping files on HDFS then you have two spark job - 1 load expected hour data (pls explain. HOw many files on average) - 1 load delayed data(pls explain. how many files on average) Do these scripts run

Re: launch spark on mesos within a docker container

2016-12-30 Thread Timothy Chen
Hi Ji, One way to make it fixed is to set LIBPROCESS_PORT environment variable on the executor when it is launched. Tim > On Dec 30, 2016, at 1:23 PM, Ji Yan wrote: > > Dear Spark Users, > > We are trying to launch Spark on Mesos from within a docker container. We > have

launch spark on mesos within a docker container

2016-12-30 Thread Ji Yan
Dear Spark Users, We are trying to launch Spark on Mesos from within a docker container. We have found that since the Spark executors need to talk back at the Spark driver, there is need to do a lot of port mapping to make that happen. We seemed to have mapped the ports on what we could find from

Re: What's the best practice to load data from RDMS to Spark

2016-12-30 Thread Palash Gupta
Hi, If you want to load from csv, you can use below procedure. Of course you need to define spark context first. (Given example to load all csv under a folder, you can use specific name for single file) // these lines are equivalent in Spark 2.0 spark.read.format("csv").option("header",

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Nicholas Hakobian
It looks like Spark 1.5 has the coalesce function, which is like NVL, but a bit more flexible. From Ayan's example you should be able to use: coalesce(b.col, c.col, 'some default') If that doesn't have the flexibility you want, you can always use nested case or if statements, but its just harder

What's the best practice to load data from RDMS to Spark

2016-12-30 Thread Raymond Xie
Hello, I am new to Spark, as a SQL developer, I only took some courses online and spent some time myself, never had a chance working on a real project. I wonder what would be the best practice (tool, procedure...) to load data (csv, excel) into Spark platform? Thank you. *Raymond*

Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication? I am interested in this thread. 2016-12-30 11:02 GMT-08:00 Ji Yan : > Thanks Michael, Tim and I have touched base and thankfully the issue has > already been resolved > > On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt

Re: Spark/Mesos with GPU support

2016-12-30 Thread Ji Yan
Thanks Michael, Tim and I have touched base and thankfully the issue has already been resolved On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt wrote: > I've cc'd Tim and Kevin, who worked on GPU support. > > On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan

Broadcast destroy

2016-12-30 Thread bryan.jeffrey
All, If we are updating broadcast variables do we need to manually destroy the replaced broadcast, or will they be automatically pruned? Thank you, Bryan Jeffrey Sent from my Windows 10 phone

Re: TallSkinnyQR

2016-12-30 Thread Sean Owen
There are no changes to Spark at all here. See my workaround below. On Fri, Dec 30, 2016, 17:18 Iman Mohtashemi wrote: > Hi guys, > Are your changes/bug fixes reflected in the Spark 2.1 release? > Iman > > On Dec 2, 2016 3:03 PM, "Iman Mohtashemi"

Re: Spark/Mesos with GPU support

2016-12-30 Thread Michael Gummelt
I've cc'd Tim and Kevin, who worked on GPU support. On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan wrote: > Dear Spark Users, > > Has anyone had successful experience running Spark on Mesos with GPU > support? We have a Mesos cluster that can see and offer nvidia GPU > resources. With

Re: TallSkinnyQR

2016-12-30 Thread Iman Mohtashemi
Hi guys, Are your changes/bug fixes reflected in the Spark 2.1 release? Iman On Dec 2, 2016 3:03 PM, "Iman Mohtashemi" wrote: > Thanks again! This is very helpful! > Best regards, > Iman > > On Dec 2, 2016 2:49 PM, "Huamin Li" <3eri...@gmail.com> wrote: > >> Hi Iman,

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
Thanks, but is nvl() in Spark 1.5? I can't find it in spark.sql.functions (http://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.functions$) Reading about the Oracle nvl function, it seems it is similar to the na functions. Not sure it will help though, because what I

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Palash Gupta
Hi Marco & Ayan, I have now clearer idea about what Marco means by Reduce. I will do it to dig down. Let me answer to your queries: hen you see the broadcast errors, does your job terminate? Palash>> Yes it terminated the app. Or are you assuming that something is wrong just because you see the

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Marco Mistroni
Correct. I mean reduce the functionality. Uhm I realised I didn't ask u a fundamental question. When you see the broadcast errors, does your job terminate? Or are you assuming that something is wrong just because you see the message in the logs? Plus...Wrt logicWho writes the CSV? With what

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread ayan guha
@Palash: I think what Macro meant by "reduce functionality" is to reduce scope of your application's functionality so that you can isolate the issue in certain part(s) of the app...I do not think he meant "reduce" operation :) On Fri, Dec 30, 2016 at 9:26 PM, Palash Gupta <

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Palash Gupta
Hi Marco, All of your suggestions are highly appreciated, whatever you said so far. I would apply to implement in my code and let you know. Let me answer your query: What does your program do? Palash>> In each hour I am loading many CSV files and then I'm making some KPI(s) out of them.

Re: Spark Partitioning Strategy with Parquet

2016-12-30 Thread titli batali
Yeah, it works for me. Thanks On Fri, Nov 18, 2016 at 3:08 AM, ayan guha wrote: > Hi > > I think you can use map reduce paradigm here. Create a key using user ID > and date and record as a value. Then you can express your operation (do > something) part as a function. If

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Palash Gupta
Hi Nicholas, Appreciated your response. Understand your articulated point & I will implement and let you know the status of the problem. Sample: // these lines are equivalent in Spark 2.0 spark.read.format("csv").option("header", "true").load("../Downloads/*.csv") spark.read.option("header",