Re: [BlockMatrix] multiply is an action or a transformation ?

2017-08-20 Thread Yanbo Liang
BlockMatrix.multiply will return another BlockMatrix. Inside this function, there are lots of steps of RDD operations, but most of them are transformation. If you don't trigger to obtain the blocks(which is an RDD of [(Int, Int, Matrix)] of the result BlockMatrix, the job will not run. Thanks

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Ah i see then I would check also directly in Hive if you have issues to insert data in the Hive table. Alternatively you can try to register the df as temptable and do a insert into the Hive table from the temptable using Spark sql ("insert into table hivetable select * from temptable") You

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
Hi, I have created hive table in impala first with storage format as parquet. With dataframe from spark I am tryinig to insert into the same table with below syntax. Table is partitoned by year,month,day ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")

Re: Spark hive overwrite is very very slow

2017-08-20 Thread ayan guha
Just curious - is your dataset partitioned on your partition columns? On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > We are in cloudera CDH5.10 and we are using spark 2 that comes with > cloudera. > > Coming to second solution, creating a temporary view

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
we are using parquet tables, is it causing any performance issue? On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke wrote: > Improving the performance of Hive can be also done by switching to > Tez+llap as an engine. > Aside from this : you need to check what is the default

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
Yes we tried hive and want to migrate to spark for better performance. I am using paraquet tables . Still no better performance while loading. Sent from my iPhone > On Aug 20, 2017, at 2:24 AM, Jörn Franke wrote: > > Have you tried directly in Hive how the performance

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Improving the performance of Hive can be also done by switching to Tez+llap as an engine. Aside from this : you need to check what is the default format that it writes to Hive. One issue for the slow storing into a hive table could be that it writes by default to csv/gzip or csv/bzip2 > On 20.

Question on how to get appended data from structured streaming

2017-08-20 Thread Yanpeng Lin
Hello, I am new to Spark. It would be appreciated if anyone could help me understand how to get appended data from structured streaming. According to the document , data stream could be treated as new

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
We are in cloudera CDH5.10 and we are using spark 2 that comes with cloudera. Coming to second solution, creating a temporary view on dataframe but it didnt improve my performance too. I do remember performance was very fast when doing whole overwrite table without partitons but the problem

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
I tried all the approaches. 1.Partitioned by year,month,day on hive table with parquet format when table is created in impala. 2. Dataset from hive is not partitioned. used insert overwrite hivePartitonedTable partition(year,month,day) select * from tempViewOFDataset . Also tried

Re: Huber regression in PySpark?

2017-08-20 Thread Yanbo Liang
Hi Jeff, Actually I have one implementation of robust regression with huber loss for a long time (https://github.com/apache/spark/pull/14326). This is a fairly straightforward porting for scikit-learn HuberRegressor. The PR making huber regression as a separate Estimator, and we found it can be

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you made sure that the saveastable stores them as parquet? > On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed > wrote: > > we are using parquet tables, is it causing any performance issue? > >> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke

a set of practice and LAB

2017-08-20 Thread Mohsen Pahlevanzadeh
Dear All, I need to a set of practice and LAB with sparc and hadoop, You will make me happy for your help. Yours, Mohsen - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question on how to get appended data from structured streaming

2017-08-20 Thread Yanpeng Lin
I am trying to implements some online algorithms based on structured streaming currently. My requirement is fetching only delta data at each trigger time from memory and calculating and updating global variables at the same time. Here are 2 points I found it's difficult: 1. with foreach writer, it

Re: Question on how to get appended data from structured streaming

2017-08-20 Thread Michael Armbrust
What is your end goal? Right now the foreach writer is the way to do arbitrary processing on the data produced by various output modes. On Sun, Aug 20, 2017 at 12:23 PM, Yanpeng Lin wrote: > Hello, > > I am new to Spark. > It would be appreciated if anyone could help me

Re: Does Spark SQL uses Calcite?

2017-08-20 Thread kant kodali
Hi Jules, I am looking to connect to Spark via JDBC so I can run Spark SQL queries via JDBC but not use SPARK SQL to connect to other JDBC sources. Thanks! On Sat, Aug 19, 2017 at 5:54 PM, Jules Damji wrote: > Try this link to see how you may connect

Huber regression in PySpark?

2017-08-20 Thread Jeff Gates
Hi guys, Is there huber regression in PySpark? We are using sklearn HuberRegressor ( http://scikit-learn.org/stable/modules/generated/sklearn.linear_model. HuberRegressor.html) to train our model, but with some bottleneck in single node. If no, is there any obstacle to implement it in PySpark?

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you tried directly in Hive how the performance is? In which Format do you expect Hive to write? Have you made sure it is in this format? It could be that you use an inefficient format (e.g. CSV + bzip2). > On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed > wrote:

Spark billing on shared Clusters

2017-08-20 Thread Jorge Machado
Hi everyone, I was wondering how it is possible to do Spark / Yarn accounting on a shared cluster based on resource usage. I found out that is no way to do that. So I develop hbilling to deal with this. Is someone interested on a quick demo or so ? More info under: www.hbilling.io