Swift question regarding in-memory snapshots of compact table data

2016-11-09 Thread Daniel Schulz
Daniel Schulz hat eine OneDrive-Datei mit Ihnen geteilt. Um diese anzuzeigen, klicken Sie unten auf den Link. [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]

RE: how to merge dataframe write output files

2016-11-09 Thread Shreya Agarwal
Is there a reason you want to merge the files? The reason you are getting errors (afaik) is because when you try to coalesce and then write, you are forcing all the content to reside on one executor, and the size of data is exceeding the memory you have for storage in your executor, hence

Akka Stream as the source for Spark Streaming. Please advice...

2016-11-09 Thread shyla deshpande
I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, Spark Streaming and Cassandra using Structured Streaming. But the kafka source support for Structured Streaming is not yet available. So now I am trying to use Akka Stream as the source to Spark Streaming. Want to make sure I

how to merge dataframe write output files

2016-11-09 Thread lk_spark
hi,all: when I call api df.write.parquet ,there is alot of small files : how can I merge then into on file ? I tried df.coalesce(1).write.parquet ,but it will get error some times Container exited with a non-zero exit code 143 more an more... -rw-r--r-- 2 hadoop supergroup 14.5 K

Unable to lauch Python Web Application on Spark Cluster

2016-11-09 Thread anjali gautam
Hello Everyone, I have developed a web application (say abc) in Python using web.py. I want to deploy it on the Spark Cluster. Since this application is a project with dependencies therefore I have made a zip file of the project (abc.zip) to be deployed on the cluster. In the project abc I have

Hive Queries are running very slowly in Spark 2.0

2016-11-09 Thread Jaya Shankar Vadisela
Hi ALL I have below simple HIVE Query, we have a use-case where we will run multiple HIVE queries in parallel, in our case it is 16 (num of cores in our machine, using scala PAR array). In Spark 1.6 it is executing in 10 secs but in Spark 2.0 same queries are taking 5 mins. "select * from emp

Hive Queries are running very slowly in Spark 2.0

2016-11-09 Thread Jaya Shankar Vadisela
Hi ALL I have below simple HIVE Query, we have a use-case where we will run multiple HIVE queries in parallel, in our case it is 16 (num of cores in our machine, using scala PAR array). In Spark 1.6 it is executing in 10 secs but in Spark 2.0 same queries are taking 5 mins. "select * from emp

Re: Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-09 Thread Ajay Chander
Hi Everyone, I am still trying to figure this one out. I am stuck with this error "java.io.IOException: Can't get Master Kerberos principal for use as renewer ". Below is my code. Can any of you please provide any insights on this? Thanks for your time. import java.io.{BufferedInputStream,

Re: Issue Running sparkR on YARN

2016-11-09 Thread Felix Cheung
It maybe the Spark executor is running as a different user and it can't see where RScript is? You might want to try putting Rscript path to PATH. Also please see this for the config property to set for the R command to use: https://spark.apache.org/docs/latest/configuration.html#sparkr

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Michael Segel
Oozie, a product only a mad Russian would love. ;-) Just say no to hive. Go from Flat to Parquet. (This sounds easy, but there’s some work that has to occur…) Sorry for being cryptic, Mich’s question is pretty much generic for anyone building a data lake so it ends up overlapping with some work

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Thanks guys, Sounds like let Informatica get the data out of RDBMS and create mapping to flat files that will be delivered to a directory visible by HDFS host. Then push the csv files into HDFS. then there are number of options to work on: 1. run cron or oozie to get data out of HDFS (or

How to interpret the Time Line on "Details for Stage" Spark UI page

2016-11-09 Thread Xiaoye Sun
Hi, I am using Spark 1.6.1, and I am looking at the Event Timeline on "Details for Stage" Spark UI web page in detail. I found that the "scheduler delay" on event timeline is somehow misrepresented. I want to confirm if my understanding is correct. Here is the detailed description: In Spark's

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
It would be great if you could try with the 2.0.2 RC. Thanks for creating an issue. On Wed, Nov 9, 2016 at 1:22 PM, Raviteja Lokineni < raviteja.lokin...@gmail.com> wrote: > Well I've tried with 1.5.2, 1.6.2 and 2.0.1 > > FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388 > >

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Jörn Franke
Basically you mention the options. However, there are several ways how informatica can extract (or store) from/to rdbms. If the native option is not available then you need to go via JDBC as you have described. Alternatively (but only if it is worth it) you can schedule fetching of the files

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Raviteja Lokineni
Well I've tried with 1.5.2, 1.6.2 and 2.0.1 FYI, I have created https://issues.apache.org/jira/browse/SPARK-18388 On Wed, Nov 9, 2016 at 3:08 PM, Michael Armbrust wrote: > Which version of Spark? Does seem like a bug. > > On Wed, Nov 9, 2016 at 10:06 AM, Raviteja

Issue Running sparkR on YARN

2016-11-09 Thread Ian.Maloney
Hi, I’m trying to run sparkR (1.5.2) on YARN and I get: java.io.IOException: Cannot run program "Rscript": error=2, No such file or directory This strikes me as odd, because I can go to each node and various users and type Rscript and it works. I’ve done this on each node and spark-env.sh as

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Michael Armbrust
Which version of Spark? Does seem like a bug. On Wed, Nov 9, 2016 at 10:06 AM, Raviteja Lokineni < raviteja.lokin...@gmail.com> wrote: > Does this stacktrace look like a bug guys? Definitely seems like one to me. > > Caused by: java.lang.StackOverflowError > at

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread ayan guha
Yes, it can be done and a standard practice. I would suggest a mixed approach: use Informatica to create files in hdfs and have hive staging tables as external tables on those directories. Then that point onwards use spark. Hth Ayan On 10 Nov 2016 04:00, "Mich Talebzadeh"

Re: Aggregations on every column on dataframe causing StackOverflowError

2016-11-09 Thread Raviteja Lokineni
Does this stacktrace look like a bug guys? Definitely seems like one to me. Caused by: java.lang.StackOverflowError at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:195) at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:195)

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-09 Thread coolgar
Solution provided by Cody K : I may be misunderstanding, but you need to take each kafka message, and turn it into multiple items in the transformed rdd? so something like (pseudocode): stream.flatMap { message => val items = new ArrayBuffer var parser = null message.split("\n").foreach {

Re: Save a spark RDD to disk

2016-11-09 Thread Michael Segel
Can you increase the number of partitions and also increase the number of executors? (This should improve the parallelization but you may become disk i/o bound) On Nov 8, 2016, at 4:08 PM, Elf Of Lothlorein > wrote: Hi I am trying to save a RDD

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Thanks Mike for insight. This is a request landed on us which is rather unusual. As I understand Informatica is an ETL tool. Most of these are glorified Sqoop with GUI where you define your source and target. In a normal day Informatica takes data out of an RDBMS like Oracle table and lands it

Re: Physical plan for windows and joins - how to know which is faster?

2016-11-09 Thread Silvio Fiorito
Hi Jacek, I haven't played with 2.1.0 yet, so not sure how much more optimized Window functions are compared to 1.6 and 2.0. However, one thing I do see in the self-join is a broadcast. So there's going to be a need broadcast the results of the groupBy out to the executors before it can do

How to impersonate a user from a Spark program

2016-11-09 Thread Samy Dindane
Hi, In order to impersonate a user when submitting a job with `spark-submit`, the `proxy-user` option is used. Is there a similar feature when running a job inside a Scala program? Maybe by specifying some configuration value? Thanks. Samy

Re: javac - No such file or directory

2016-11-09 Thread Sonal Goyal
It looks to be an issue with the java compiler, is the jdk setup correctly? Please check your java installation. Thanks, Sonal Nube Technologies On Wed, Nov 9, 2016 at 7:13 PM, Andrew Holway < andrew.hol...@otternetworks.de>

importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Mich Talebzadeh
Hi, I am exploring the idea of flexibility with importing multiple RDBMS tables using Informatica that customer has into HDFS. I don't want to use connectivity tools from Informatica to Hive etc. So this is what I have in mind 1. If possible get the tables data out using Informatica and

javac - No such file or directory

2016-11-09 Thread Andrew Holway
I'm getting this error trying to build spark on Centos7. It is not googling very well: [error] (tags/compile:compileIncremental) java.io.IOException: Cannot run program "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/bin/javac" (in directory "/home/spark/spark"): error=2, No such

Physical plan for windows and joins - how to know which is faster?

2016-11-09 Thread Jacek Laskowski
Hi, While playing around with Spark 2.1.0-SNAPSHOT (built today) and explain'ing two queries with WindowSpec and inner join I found the following plans and am wondering if you could help me to judge which query could be faster. What else would you ask for to be able to answer the question of one

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-09 Thread Carlo . Allocca
Hi Masood, Thanks for the answer. Sure. I will do as suggested. Many Thanks, Best Regards, Carlo On 8 Nov 2016, at 17:19, Masood Krohy > wrote: labels -- The Open University is incorporated by Royal Charter (RC 000391), an exempt

Application config management

2016-11-09 Thread Erwan ALLAIN
Hi everyone, I d like to know what kind of configuration mechanism is used in general ? Below is what I m going to implement but I d like to know if there is any "standard way" 1) put configuration in hdfs 2) specify extrajavaoptions (driver and worker) with the hdfs url (

Re: installing spark-jobserver on cdh 5.7 and yarn

2016-11-09 Thread Noorul Islam K M
Reza zade writes: > Hi > > I have set up a cloudera cluster and work with spark. I want to install > spark-jobserver on it. What should I do? Maybe you should send this to spark-jobserver mailing list. https://github.com/spark-jobserver/spark-jobserver#contact Thanks and

installing spark-jobserver on cdh 5.7 and yarn

2016-11-09 Thread Reza zade
Hi I have set up a cloudera cluster and work with spark. I want to install spark-jobserver on it. What should I do?

Spark streaming delays spikes

2016-11-09 Thread Shlomi.b
Hi, We are using spark streaming version 1.6.2 and came across a weird behavior. Our system pulls log events data from flume servers, enrich the events and save them to ES. We are using window interval of 15 seconds and the rate on peak hours is around 70K events. The average time to process the