??????Cleaning spark memory

2016-06-10 Thread Ricky
wml 

iPhone

--  --
??: Takeshi Yamamuro 
: 2016611 12:15
??: Cesar Flores 
: user 
: Re: Cleaning spark memory

Re: Cleaning spark memory

2016-06-10 Thread Takeshi Yamamuro
Hi,

If you remove all cached data, please use `SQLContext#clearCache`.

// maropu

On Sat, Jun 11, 2016 at 3:18 AM, Cesar Flores  wrote:

>
> Hello:
>
> Sometimes I cache data frames to memory that I forget to unpersist, losing
> the variable reference in the process.
>
> Is there a way of: (i) is there a way of recovering references to data
> frames that are still persisted in memory OR (ii) a way of just unpersist
> all spark cached variables?
>
>
> Thanks
> --
> Cesar Flores
>



-- 
---
Takeshi Yamamuro


Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Thanks for suggestion. Can you suggest me from where and how I how to start
from the scratch to work on Spark.

On Fri, Jun 10, 2016 at 8:18 PM, Holden Karau  wrote:

> So that's a bit complicated - you might want to start with reading the
> code for the existing algorithms and go from there. If your goal is to
> contribute the algorithm to Spark you should probably take a look at the
> JIRA as well as the contributing to Spark guide on the wiki. Also we have a
> seperate list (dev@) for people looking to work on Spark it's self.
> I'd also recommend a smaller first project building new functionality in
> Spark as a good starting point rather than adding a new algorithm right
> away, since you learn a lot in the process of making your first
> contribution.
>
> On Friday, June 10, 2016, Ram Krishna  wrote:
>
>> Hi All,
>>
>> How to add new ML algo in Spark MLlib.
>>
>> On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna 
>> wrote:
>>
>>> Hi All,
>>>
>>> I am new to this this field, I want to implement new ML algo using Spark
>>> MLlib. What is the procedure.
>>>
>>> --
>>> Regards,
>>> Ram Krishna KT
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>> Ram Krishna KT
>>
>>
>>
>>
>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Regards,
Ram Krishna KT


Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Mich, I completely agree with you. I built another Spark SQL application
which reads data from MySQL and SQL server and writes the data into
Hive(parquet+snappy format). I have this problem only when I read directly
from remote SAS system. The interesting part is I am using same driver to
read data through pure Java app and spark app. It works fine in Java app,
so I cannot blame SAS driver here. Trying to understand where the problem
could be. Thanks for sharing this with me.

On Friday, June 10, 2016, Mich Talebzadeh  wrote:

> I personally use Scala to do something similar. For example here I extract
> data from an Oracle table and store in ORC table in Hive. This is compiled
> via sbt as run with SparkSubmit.
>
> It is similar to your code but in Scala. Note that I do not enclose my
> column names in double quotes.
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
>
> object ETL_scratchpad_dummy {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>setAppName("ETL_scratchpad_dummy").
>set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   HiveContext.sql("use oraclehadoop")
>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>   var _username : String = "scratchpad"
>   var _password : String = ""
>
>   // Get data from Oracle table scratchpad.dummy
>   val d = HiveContext.load("jdbc",
>   Map("url" -> _ORACLEserver,
>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>   "user" -> _username,
>   "password" -> _password))
>
>d.registerTempTable("tmp")
>   //
>   // Need to create and populate target ORC table oraclehadoop.dummy
>   //
>   HiveContext.sql("use oraclehadoop")
>   //
>   // Drop and create table dummy
>   //
>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>   var sqltext : String = ""
>   sqltext = """
>   CREATE TABLE oraclehadoop.dummy (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
>   )
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES (
>   "orc.create.index"="true",
>   "orc.bloom.filter.columns"="ID",
>   "orc.bloom.filter.fpp"="0.05",
>   "orc.compress"="SNAPPY",
>   "orc.stripe.size"="16777216",
>   "orc.row.index.stride"="1" )
>   """
>HiveContext.sql(sqltext)
>   //
>   // Put data in Hive table. Clean up is already done
>   //
>   sqltext = """
>   INSERT INTO TABLE oraclehadoop.dummy
>   SELECT
>   ID
> , CLUSTERED
> , SCATTERED
> , RANDOMISED
> , RANDOM_STRING
> , SMALL_VC
> , PADDING
>   FROM tmp
>   """
>HiveContext.sql(sqltext)
>   println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   sys.exit()
>  }
> }
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 23:38, Ajay Chander  > wrote:
>
>> Hi Mich,
>>
>> Thanks for the response. If you look at my programs, I am not writings my
>> queries to include column names in a pair of "". My driver in spark
>> program is generating such query with column names in "" which I do not
>> want. On the other hand, I am using the same driver in my pure Java program
>> which is attached, in that program the same driver is generating a proper
>> sql query with out "".
>>
>> Pure Java log:
>>
>> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>> Spark SQL log:
>>
>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
>> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>
>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>> 

Neither previous window has value for key, nor new values found

2016-06-10 Thread Marco Platania
Hi all, 

I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The 
window interval is 2 hours, while the slide interval is 1 hour. I have a 
JavaPairRDD in which both keys and values are strings. Each time the 
reduceByKeyAndWindow() function is called, it uses appendString() and 
removeString() functions below to incrementally build a windowed stream of 
data: 

Function2 appendString = new Function2() { 
      @Override 
      public String call(String s1, String s2) { 
        return s1 + s2; 
      } 
    }; 

    Function2 removeString = new Function2() { 
      @Override 
      public String call(String s1, String s2) { 
        return s1.replace(s2, ""); 
      } 
    }; 

filterEmptyRecords() removes keys that eventually won't contain any value: 

    Function, Boolean> filterEmptyRecords = new 
Function, Boolean>() { 
      @Override 
      public Boolean call(scala.Tuple2 t) { 
        return (!t._2().isEmpty()); 
      } 
    }; 

The windowed operation is then: 

JavaPairDStream cdr_kv = 
cdr_filtered.reduceByKeyAndWindow(appendString, removeString, 
Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION), 
PARTITIONS, filterEmptyRecords); 

After a few hours of operation, this function raises the following exception: 
"Neither previous window has value for key, nor new values found. Are you sure 
your key class hashes consistently?" 

I've found this post from 2013: 
https://groups.google.com/forum/#!msg/spark-users/9OM1YvWzwgE/PhFgdSTP2OQJ
which however doesn't solve my problem. I'm using String to represent keys, 
which I'm pretty sure hash consistently. 

Any clue why this happens and possible suggestions to fix it? 

Thanks!


Slow collecting of large Spark Data Frames into R

2016-06-10 Thread Jonathan Mortensen
Hey Everyone!

I've been converting between Parquet <-> Spark Data Frames <-> R Data
Frames for larger data sets. I have found the conversion speed quite
slow in the Spark <-> R side and am looking for some insight on how to
speed it up (or determine what I have failed to do properly)!

In R, "sparkR::collect" and "sparkR::write.df" take much longer than
Spark reading and writing Parquet. While these aren’t the same
operations, the difference suggests that there is a bottleneck within
the translation between R data frames and Spark Data Frames. A profile
of the SparkR code shows that R is spending a large portion of its
time within "sparkR:::readTypedObject", "sparkR:::readBin", and
"sparkR:::readObject". To me, this suggests that the serialization
step accounts for the slow speed, but I don't want to guess too much.
Any thoughts on how to speed the conversion?

Details:
Tried with Spark 2.0 and 1.6.1 (and the associated SparkR package) and R 3.3.0.
On a Macbook Pro, 16BG Ram, Quad Core.

++---+-+--+
| # Rows | # Columns | sparkR::collect | sparkR::write.df |
++---+-+--+
| 600K   | 20| 3min| 6min |
++---+-+--+
| 1.8M   | 20| 9min| 20min|
++---+-+--+
| 600K   | 1 | 40 sec  | 4min |
++---+-+--+

Thanks!
Jonathan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
I personally use Scala to do something similar. For example here I extract
data from an Oracle table and store in ORC table in Hive. This is compiled
via sbt as run with SparkSubmit.

It is similar to your code but in Scala. Note that I do not enclose my
column names in double quotes.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._

object ETL_scratchpad_dummy {
  def main(args: Array[String]) {
  val conf = new SparkConf().
   setAppName("ETL_scratchpad_dummy").
   set("spark.driver.allowMultipleContexts", "true")
  val sc = new SparkContext(conf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)
  import sqlContext.implicits._
  val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  println ("\nStarted at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
  HiveContext.sql("use oraclehadoop")
  var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
  var _username : String = "scratchpad"
  var _password : String = ""

  // Get data from Oracle table scratchpad.dummy
  val d = HiveContext.load("jdbc",
  Map("url" -> _ORACLEserver,
  "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED,
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
  "user" -> _username,
  "password" -> _password))

   d.registerTempTable("tmp")
  //
  // Need to create and populate target ORC table oraclehadoop.dummy
  //
  HiveContext.sql("use oraclehadoop")
  //
  // Drop and create table dummy
  //
  HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
  var sqltext : String = ""
  sqltext = """
  CREATE TABLE oraclehadoop.dummy (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
  )
  CLUSTERED BY (ID) INTO 256 BUCKETS
  STORED AS ORC
  TBLPROPERTIES (
  "orc.create.index"="true",
  "orc.bloom.filter.columns"="ID",
  "orc.bloom.filter.fpp"="0.05",
  "orc.compress"="SNAPPY",
  "orc.stripe.size"="16777216",
  "orc.row.index.stride"="1" )
  """
   HiveContext.sql(sqltext)
  //
  // Put data in Hive table. Clean up is already done
  //
  sqltext = """
  INSERT INTO TABLE oraclehadoop.dummy
  SELECT
  ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, RANDOM_STRING
, SMALL_VC
, PADDING
  FROM tmp
  """
   HiveContext.sql(sqltext)
  println ("\nFinished at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
  sys.exit()
 }
}

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 23:38, Ajay Chander  wrote:

> Hi Mich,
>
> Thanks for the response. If you look at my programs, I am not writings my
> queries to include column names in a pair of "". My driver in spark
> program is generating such query with column names in "" which I do not
> want. On the other hand, I am using the same driver in my pure Java program
> which is attached, in that program the same driver is generating a proper
> sql query with out "".
>
> Pure Java log:
>
> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
> Spark SQL log:
>
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> Please find complete program and full logs attached in the below thread.
> Thank you.
>
> Regards,
> Ajay
>
>
> On Friday, June 10, 2016, Mich Talebzadeh 
> wrote:
>
>> Assuming I understood your query, in Spark SQL (that is you log in to
>> spark sql like  spark-sql --master spark://:7077 you do not
>> need double quotes around column names for sql to work
>>
>> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
>> hello from Mich
>>
>> Anything between a pair of "" will be interpreted as text NOT column name.
>>
>> In Spark SQL you do not need double quotes. So simply
>>
>> spark-sql> select prod_id, cust_id from sales limit 2;
>> 17  28017
>> 18  10419
>>
>> HTH

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Mich,

Thanks for the response. If you look at my programs, I am not writings my
queries to include column names in a pair of "". My driver in spark
program is generating such query with column names in "" which I do not
want. On the other hand, I am using the same driver in my pure Java program
which is attached, in that program the same driver is generating a proper
sql query with out "".

Pure Java log:

2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
Spark SQL log:

[2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)

[2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
Please find complete program and full logs attached in the below thread.
Thank you.

Regards,
Ajay

On Friday, June 10, 2016, Mich Talebzadeh  wrote:

> Assuming I understood your query, in Spark SQL (that is you log in to
> spark sql like  spark-sql --master spark://:7077 you do not
> need double quotes around column names for sql to work
>
> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
> hello from Mich
>
> Anything between a pair of "" will be interpreted as text NOT column name.
>
> In Spark SQL you do not need double quotes. So simply
>
> spark-sql> select prod_id, cust_id from sales limit 2;
> 17  28017
> 18  10419
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 21:54, Ajay Chander  > wrote:
>
>> Hi again, anyone in this group tried to access SAS dataset through Spark
>> SQL ? Thank you
>>
>> Regards,
>> Ajay
>>
>>
>> On Friday, June 10, 2016, Ajay Chander > > wrote:
>>
>>> Hi Spark Users,
>>>
>>> I hope everyone here are doing great.
>>>
>>> I am trying to read data from SAS through Spark SQL and write into HDFS.
>>> Initially, I started with pure java program please find the program and
>>> logs in the attached file sas_pure_java.txt . My program ran
>>> successfully and it returned the data from Sas to Spark_SQL. Please
>>> note the highlighted part in the log.
>>>
>>> My SAS dataset has 4 rows,
>>>
>>> Program ran successfully. So my output is,
>>>
>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>
>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>
>>> 1,'2016-01-01','2016-01-31'
>>>
>>> 2,'2016-02-01','2016-02-29'
>>>
>>> 3,'2016-03-01','2016-03-31'
>>>
>>> 4,'2016-04-01','2016-04-30'
>>>
>>>
>>> Please find the full logs attached to this email in file
>>> sas_pure_java.txt.
>>>
>>> ___
>>>
>>>
>>> Now I am trying to do the same via Spark SQL. Please find my program
>>> and logs attached to this email in file sas_spark_sql.txt .
>>>
>>> Connection to SAS dataset is established successfully. But please note
>>> the highlighted log below.
>>>
>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>
>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result
>>> set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>> Please find the full logs attached to this email in file
>>>  sas_spark_sql.txt
>>>
>>> I am using same driver in both pure java and spark sql programs. But the
>>> query generated in spark sql has quotes around the column names(Highlighted
>>> above).
>>> So my resulting output for that query is like this,
>>>
>>> +-++--+
>>> |  _c0| _c1|   _c2|
>>> +-++--+
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> +-++--+
>>>
>>> Since both programs are using the same driver com.sas.rio.MVADriver .
>>> Expected output should be same as my pure java programs output. But
>>> something else is happening behind the scenes.
>>>
>>> Any insights on this issue. Thanks for your time.
>>>
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>
>


Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
Assuming I understood your query, in Spark SQL (that is you log in to spark
sql like  spark-sql --master spark://:7077 you do not need
double quotes around column names for sql to work

spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
hello from Mich

Anything between a pair of "" will be interpreted as text NOT column name.

In Spark SQL you do not need double quotes. So simply

spark-sql> select prod_id, cust_id from sales limit 2;
17  28017
18  10419

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 21:54, Ajay Chander  wrote:

> Hi again, anyone in this group tried to access SAS dataset through Spark
> SQL ? Thank you
>
> Regards,
> Ajay
>
>
> On Friday, June 10, 2016, Ajay Chander  wrote:
>
>> Hi Spark Users,
>>
>> I hope everyone here are doing great.
>>
>> I am trying to read data from SAS through Spark SQL and write into HDFS.
>> Initially, I started with pure java program please find the program and
>> logs in the attached file sas_pure_java.txt . My program ran
>> successfully and it returned the data from Sas to Spark_SQL. Please note
>> the highlighted part in the log.
>>
>> My SAS dataset has 4 rows,
>>
>> Program ran successfully. So my output is,
>>
>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>
>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time=
>> 0.045 secs (com.sas.rio.MVAResultSet:773)
>>
>> 1,'2016-01-01','2016-01-31'
>>
>> 2,'2016-02-01','2016-02-29'
>>
>> 3,'2016-03-01','2016-03-31'
>>
>> 4,'2016-04-01','2016-04-30'
>>
>>
>> Please find the full logs attached to this email in file
>> sas_pure_java.txt.
>>
>> ___
>>
>>
>> Now I am trying to do the same via Spark SQL. Please find my program and
>> logs attached to this email in file sas_spark_sql.txt .
>>
>> Connection to SAS dataset is established successfully. But please note
>> the highlighted log below.
>>
>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
>> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>
>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
>> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>> Please find the full logs attached to this email in file
>>  sas_spark_sql.txt
>>
>> I am using same driver in both pure java and spark sql programs. But the
>> query generated in spark sql has quotes around the column names(Highlighted
>> above).
>> So my resulting output for that query is like this,
>>
>> +-++--+
>> |  _c0| _c1|   _c2|
>> +-++--+
>> |SR_NO|start_dt|end_dt|
>> |SR_NO|start_dt|end_dt|
>> |SR_NO|start_dt|end_dt|
>> |SR_NO|start_dt|end_dt|
>> +-++--+
>>
>> Since both programs are using the same driver com.sas.rio.MVADriver .
>> Expected output should be same as my pure java programs output. But
>> something else is happening behind the scenes.
>>
>> Any insights on this issue. Thanks for your time.
>>
>>
>> Regards,
>>
>> Ajay
>>
>


Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi again, anyone in this group tried to access SAS dataset through Spark
SQL ? Thank you

Regards,
Ajay

On Friday, June 10, 2016, Ajay Chander  wrote:

> Hi Spark Users,
>
> I hope everyone here are doing great.
>
> I am trying to read data from SAS through Spark SQL and write into HDFS.
> Initially, I started with pure java program please find the program and
> logs in the attached file sas_pure_java.txt . My program ran successfully
> and it returned the data from Sas to Spark_SQL. Please note the
> highlighted part in the log.
>
> My SAS dataset has 4 rows,
>
> Program ran successfully. So my output is,
>
> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>
> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time=
> 0.045 secs (com.sas.rio.MVAResultSet:773)
>
> 1,'2016-01-01','2016-01-31'
>
> 2,'2016-02-01','2016-02-29'
>
> 3,'2016-03-01','2016-03-31'
>
> 4,'2016-04-01','2016-04-30'
>
>
> Please find the full logs attached to this email in file sas_pure_java.txt.
>
> ___
>
>
> Now I am trying to do the same via Spark SQL. Please find my program and
> logs attached to this email in file sas_spark_sql.txt .
>
> Connection to SAS dataset is established successfully. But please note
> the highlighted log below.
>
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> Please find the full logs attached to this email in file
>  sas_spark_sql.txt
>
> I am using same driver in both pure java and spark sql programs. But the
> query generated in spark sql has quotes around the column names(Highlighted
> above).
> So my resulting output for that query is like this,
>
> +-++--+
> |  _c0| _c1|   _c2|
> +-++--+
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> +-++--+
>
> Since both programs are using the same driver com.sas.rio.MVADriver .
> Expected output should be same as my pure java programs output. But
> something else is happening behind the scenes.
>
> Any insights on this issue. Thanks for your time.
>
>
> Regards,
>
> Ajay
>


Re: Saving Parquet files to S3

2016-06-10 Thread Bijay Kumar Pathak
Hi Ankur,

I also tried setting a property to write parquet file size of 256MB. I am
using pyspark below is how I set the property but it's not working for me.
How did you set the property?


spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456)
spark_context._jsc.hadoopConfiguration().setInt( "parquet.block.size",
268435)

Thanks,
Bijay

On Fri, Jun 10, 2016 at 5:24 AM, Ankur Jain  wrote:

> Thanks maropu.. It worked…
>
>
>
> *From:* Takeshi Yamamuro [mailto:linguin@gmail.com]
> *Sent:* 10 June 2016 11:47 AM
> *To:* Ankur Jain
> *Cc:* user@spark.apache.org
> *Subject:* Re: Saving Parquet files to S3
>
>
>
> Hi,
>
>
>
> You'd better off `setting parquet.block.size`.
>
>
>
> // maropu
>
>
>
> On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann <
> daniel.siegm...@teamaol.com> wrote:
>
> I don't believe there's anyway to output files of a specific size. What
> you can do is partition your data into a number of partitions such that the
> amount of data they each contain is around 1 GB.
>
>
>
> On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain  wrote:
>
> Hello Team,
>
>
>
> I want to write parquet files to AWS S3, but I want to size each file size
> to 1 GB.
>
> Can someone please guide me on how I can achieve the same?
>
>
>
> I am using AWS EMR with spark 1.6.1.
>
>
>
> Thanks,
>
> Ankur
>
> Information transmitted by this e-mail is proprietary to YASH Technologies
> and/ or its Customers and is intended for use only by the individual or
> entity to which it is addressed, and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. If
> you are not the intended recipient or it appears that this mail has been
> forwarded to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly prohibited. In
> such cases, please notify us immediately at i...@yash.com and delete this
> mail from your records.
>
>
>
>
>
>
>
> --
>
> ---
> Takeshi Yamamuro
> Information transmitted by this e-mail is proprietary to YASH Technologies
> and/ or its Customers and is intended for use only by the individual or
> entity to which it is addressed, and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. If
> you are not the intended recipient or it appears that this mail has been
> forwarded to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly prohibited. In
> such cases, please notify us immediately at i...@yash.com and delete this
> mail from your records.
>


Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi Gavin,

for the first time someone is responding to this thread with a meaningful
conversation - thanks for that.

Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter
and since the cached field was around 75 MB therefore I do not think that
broadcast join was used.

But I will surely be excited to see if I am going wrong here and post the
results of sql.describe(). Thanks a ton once again.


Hi Ted,

Is there anyway you can throw some light on this before I post this in a
blog?


Regards,
Gourav Sengupta


On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue  wrote:

> Yes.  because in the second query, you did a  (select PK from A) A .  I
>  guess it could the the subquery makes the results much smaller and make
> the broadcastJoin, so it is much faster.
>
> you could use sql.describe() to check the execution plan.
>
>
> On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I think if we try to see why is Query 2 faster than Query 1 then all the
>> answers will be given without beating around the bush. That is the right
>> way to find out what is happening and why.
>>
>>
>> Regards,
>> Gourav
>>
>> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue 
>> wrote:
>>
>>> Could you print out the sql execution plan? My guess is about broadcast
>>> join.
>>>
>>>
>>>
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta 
>>> wrote:
>>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> ---
>>> ENVIRONMENT:
>>> ---
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --
>>> QUERY1:
>>> --
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --
>>> QUERY 2:
>>> --
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>


Neither previous window has value for key, nor new values found.

2016-06-10 Thread Marco1982
Hi all,

I'm running a Spark Streaming application that uses reduceByKeyAndWindow().
The window interval is 2 hours, while the slide interval is 1 hour. I have a
JavaPairRDD in which both keys and values are strings. Each time the
reduceByKeyAndWindow() function is called, it uses appendString() and
removeString() functions below to incrementally build a windowed stream of
data:

Function2 appendString = new Function2() {
  @Override
  public String call(String s1, String s2) {
return s1 + s2;
  }
};

Function2 removeString = new Function2() {
  @Override
  public String call(String s1, String s2) {
return s1.replace(s2, "");
  }
};

filterEmptyRecords() removes keys that eventually won't contain any value:

Function, Boolean> filterEmptyRecords =
new Function, Boolean>() {
  @Override
  public Boolean call(scala.Tuple2 t) {
return (!t._2().isEmpty());
  }
};

The windowed operation is then:

JavaPairDStream cdr_kv =
cdr_filtered.reduceByKeyAndWindow(appendString, removeString,
Durations.seconds(WINDOW_DURATION), Durations.seconds(SLIDE_DURATION),
PARTITIONS, filterEmptyRecords);

After a few hours of operation, this function raises the following
exception:
"Neither previous window has value for key, nor new values found. Are you
sure your key class hashes consistently?"

I've found this post from 2013:
https://groups.google.com/forum/#!msg/spark-users/9OM1YvWzwgE/PhFgdSTP2OQJ
which however doesn't solve my problem. I'm using String to represent keys,
which I'm pretty sure hash consistently.

Any clue why this happens and possible suggestions to fix it?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Neither-previous-window-has-value-for-key-nor-new-values-found-tp27140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Updated Spark logo

2016-06-10 Thread Matei Zaharia
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ 
to say "Apache Spark" instead of just "Spark". Many ASF projects have been 
doing this recently to make it clearer that they are associated with the ASF, 
and indeed the ASF's branding guidelines generally require that projects be 
referred to as "Apache X" in various settings, especially in related commercial 
or open source products (https://www.apache.org/foundation/marks/). If you have 
any kind of site or product that uses Spark logo, it would be great to update 
to this full one.

There are EPS versions of the logo available at 
https://spark.apache.org/images/spark-logo.eps and 
https://spark.apache.org/images/spark-logo-reverse.eps; before using these also 
check https://www.apache.org/foundation/marks/.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



DataFrame.foreach(scala.Function1) example

2016-06-10 Thread Mohammad Tariq
Dear fellow spark users,

Could someone please point me to any example showcasing the usage of
*DataFrame.oreach(scala.Function1)* in *Java*?

*Problem statement :* I am reading data from a Kafka topic, and for each
RDD in the DStream I am creating a DataFrame in order to perform some
operations. After this I have to call *DataFrame.javaRDD* to convert the
resulting DF back into a *JavaRDD* so that I can perform further
computation on each record in this RDD through *JavaRDD.foreach*.

However, I wish to remove this additional hop of creating JavaRDD from the
resultant DF. I would like to use *DataFrame.oreach(scala.Function1)*
instead, and perform the computation directly on each *Row* of this DF
rather than having to first convert the DF into JavaRDD and then performing
any computations.

For interfaces like *Function* and *VoidFunction*(provided by
org.apache.spark.api.java.function API) I could implement named classes and
use them in my code. But having some hard time in figuring out how to use
*scala.Function1*

Thank you so much for your valuable time. Really appreciate it!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Pls assist: Spark DecisionTree question

2016-06-10 Thread Marco Mistroni
HI all
 i am trying to run a ML program against some data, using DecisionTrees.
To fine tune the parameters, i am running this loop to find the optimal
values for
impurity, depth and bins

for (impurity <- Array("gini", "entropy");
   depth<- Array(1,2,3, 4, 5);
   bins <- Array(10,20,25,28)) yield {
   val model = DecisionTree.trainClassifier(
   trainingData, numClasses, categoricalFeaturesInfo,
   impurity, depth, bins)

   val accuracy = getMetrics(model, testData).precision
   ((impurity, depth, bins), accuracy)

Could anyone explain me
why, if i run my program multiple times against the SAME
data, i get different optimal results for the parameters above?
i assume if i run the loop above agains the same data i will always get the
same results?
to  give you an example run1 returned following top results

((gini,4,28),0.8)
((gini,4,25),0.8)
((gini,3,28),0.8)
((gini,3,25),0.8)
((entropy,3,28),0.7333)

while run2 gives me this top results


((entropy,2,28),0.6842105263157895)
((entropy,2,25),0.6842105263157895)
((entropy,2,20),0.6842105263157895)
((entropy,2,10),0.6842105263157895)
((entropy,1,28),0.684210526315789

could anyone explain why?

kind regards
 marco


Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue
Yes.  because in the second query, you did a  (select PK from A) A .  I
 guess it could the the subquery makes the results much smaller and make
the broadcastJoin, so it is much faster.

you could use sql.describe() to check the execution plan.


On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta 
wrote:

> Hi,
>
> I think if we try to see why is Query 2 faster than Query 1 then all the
> answers will be given without beating around the bush. That is the right
> way to find out what is happening and why.
>
>
> Regards,
> Gourav
>
> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue  wrote:
>
>> Could you print out the sql execution plan? My guess is about broadcast
>> join.
>>
>>
>>
>> On Jun 9, 2016, at 07:14, Gourav Sengupta 
>> wrote:
>>
>> Hi,
>>
>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
>> and is there a way we can optimize the queries in SPARK without the obvious
>> hack in Query2.
>>
>>
>> ---
>> ENVIRONMENT:
>> ---
>>
>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --
>> QUERY1:
>> --
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
>>
>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>
>>
>> --
>> QUERY 2:
>> --
>>
>> select A.PK, B.FK
>> from (select PK from A) A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>> This query takes 4.5 mins in SPARK
>>
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>


Cleaning spark memory

2016-06-10 Thread Cesar Flores
Hello:

Sometimes I cache data frames to memory that I forget to unpersist, losing
the variable reference in the process.

Is there a way of: (i) is there a way of recovering references to data
frames that are still persisted in memory OR (ii) a way of just unpersist
all spark cached variables?


Thanks
-- 
Cesar Flores


Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
Right without knowing what exactly the code it is difficult to say.

Do you analyze the stuff from your Spark GUI? For example looking at the
amount of spillage and spill size as the DAG diagram shows below?


​
After three days is a short period of time, so it is concerning!


HTH

P.S. What is the nature of this spark streaming if you can divulge on it?

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 18:48, John Simon  wrote:

> Hi Mich,
>
> batch interval is 10 seconds, and I don't use sliding window.
> Typical message count per batch is ~100k.
>
>
> --
> John Simon
>
> On Fri, Jun 10, 2016 at 10:31 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi John,
>>
>> I did not notice anything unusual in your env variables.
>>
>> However, what are the batch interval, the windowsLength and
>> SlindingWindow interval.
>>
>> Also how many messages are sent by Kafka in a typical batch interval?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 10 June 2016 at 18:21, john.simon  wrote:
>>
>>> Hi all,
>>>
>>> I'm running Spark Streaming with Kafka Direct Stream, but after
>>> running a couple of days, the batch processing time almost doubles.
>>> I didn't find any slowdown on JVM GC logs, but I did find that Spark
>>> broadcast variable reading time increasing.
>>> Initially it takes less than 10ms, but after 3 days it takes more than
>>> 60ms. It's really puzzling since I don't use broadcast variables at
>>> all.
>>>
>>> My application needs to run 24/7, so I hope there's something I'm
>>> missing to correct this behavior.
>>>
>>> FYI, we're running on AWS EMR with Spark version 1.6.1, in YARN client
>>> mode.
>>> Attached spark application environment settings file.
>>>
>>> --
>>> John Simon
>>>
>>> *environment.txt* (7K) Download Attachment
>>> 
>>>
>>> --
>>> View this message in context: Long Running Spark Streaming getting
>>> slower
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>>  at Nabble.com.
>>>
>>
>>
>


Re: Long Running Spark Streaming getting slower

2016-06-10 Thread John Simon
Hi Mich,

batch interval is 10 seconds, and I don't use sliding window.
Typical message count per batch is ~100k.


--
John Simon

On Fri, Jun 10, 2016 at 10:31 AM, Mich Talebzadeh  wrote:

> Hi John,
>
> I did not notice anything unusual in your env variables.
>
> However, what are the batch interval, the windowsLength and SlindingWindow
> interval.
>
> Also how many messages are sent by Kafka in a typical batch interval?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 18:21, john.simon  wrote:
>
>> Hi all,
>>
>> I'm running Spark Streaming with Kafka Direct Stream, but after
>> running a couple of days, the batch processing time almost doubles.
>> I didn't find any slowdown on JVM GC logs, but I did find that Spark
>> broadcast variable reading time increasing.
>> Initially it takes less than 10ms, but after 3 days it takes more than
>> 60ms. It's really puzzling since I don't use broadcast variables at
>> all.
>>
>> My application needs to run 24/7, so I hope there's something I'm
>> missing to correct this behavior.
>>
>> FYI, we're running on AWS EMR with Spark version 1.6.1, in YARN client
>> mode.
>> Attached spark application environment settings file.
>>
>> --
>> John Simon
>>
>> *environment.txt* (7K) Download Attachment
>> 
>>
>> --
>> View this message in context: Long Running Spark Streaming getting slower
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


Re: Error writing parquet to S3

2016-06-10 Thread Peter Halliday
Has anyone else seen this before?  Before when I saw this there was an OOM but 
doesn’t seem so.  Of course, I’m not sure how large the file that created this 
was either.

Peter 


> On Jun 9, 2016, at 9:00 PM, Peter Halliday  wrote:
> 
> I’m not 100% sure why I’m getting this.  I don’t see any errors before this 
> at all.  I’m not sure how to diagnose this.
> 
> 
> Peter Halliday
> 
> 
> 
> 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager 
> [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, 
> ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed 
> while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: The file being written is in an invalid 
> state. Probably caused by an error thrown previously. Current state: COLUMN
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
>   ... 8 more
> 



Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
Hi John,

I did not notice anything unusual in your env variables.

However, what are the batch interval, the windowsLength and SlindingWindow
interval.

Also how many messages are sent by Kafka in a typical batch interval?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 18:21, john.simon  wrote:

> Hi all,
>
> I'm running Spark Streaming with Kafka Direct Stream, but after
> running a couple of days, the batch processing time almost doubles.
> I didn't find any slowdown on JVM GC logs, but I did find that Spark
> broadcast variable reading time increasing.
> Initially it takes less than 10ms, but after 3 days it takes more than
> 60ms. It's really puzzling since I don't use broadcast variables at
> all.
>
> My application needs to run 24/7, so I hope there's something I'm
> missing to correct this behavior.
>
> FYI, we're running on AWS EMR with Spark version 1.6.1, in YARN client
> mode.
> Attached spark application environment settings file.
>
> --
> John Simon
>
> *environment.txt* (7K) Download Attachment
> 
>
> --
> View this message in context: Long Running Spark Streaming getting slower
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Long Running Spark Streaming getting slower

2016-06-10 Thread john.simon
Hi all,

I'm running Spark Streaming with Kafka Direct Stream, but after
running a couple of days, the batch processing time almost doubles.
I didn't find any slowdown on JVM GC logs, but I did find that Spark
broadcast variable reading time increasing.
Initially it takes less than 10ms, but after 3 days it takes more than
60ms. It's really puzzling since I don't use broadcast variables at
all.

My application needs to run 24/7, so I hope there's something I'm
missing to correct this behavior.

FYI, we're running on AWS EMR with Spark version 1.6.1, in YARN client mode.
Attached spark application environment settings file.

--
John Simon


environment.txt (7K) 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-Running-Spark-Streaming-getting-slower-tp27138.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Spark Users,

I hope everyone here are doing great.

I am trying to read data from SAS through Spark SQL and write into HDFS.
Initially, I started with pure java program please find the program and
logs in the attached file sas_pure_java.txt . My program ran successfully
and it returned the data from Sas to Spark_SQL. Please note the highlighted
part in the log.

My SAS dataset has 4 rows,

Program ran successfully. So my output is,

[2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)

[2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time=
0.045 secs (com.sas.rio.MVAResultSet:773)

1,'2016-01-01','2016-01-31'

2,'2016-02-01','2016-02-29'

3,'2016-03-01','2016-03-31'

4,'2016-04-01','2016-04-30'


Please find the full logs attached to this email in file sas_pure_java.txt.

___


Now I am trying to do the same via Spark SQL. Please find my program and
logs attached to this email in file sas_spark_sql.txt .

Connection to SAS dataset is established successfully. But please note the
highlighted log below.

[2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)

[2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
Please find the full logs attached to this email in file sas_spark_sql.txt

I am using same driver in both pure java and spark sql programs. But the
query generated in spark sql has quotes around the column names(Highlighted
above).
So my resulting output for that query is like this,

+-++--+
|  _c0| _c1|   _c2|
+-++--+
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
+-++--+

Since both programs are using the same driver com.sas.rio.MVADriver .
Expected output should be same as my pure java programs output. But
something else is happening behind the scenes.

Any insights on this issue. Thanks for your time.


Regards,

Ajay
Spark Code to read SAS dataset
-


package com.test.sas.connections;

import java.sql.SQLException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.hive.HiveContext;


public class SparkSasConnectionTest {

public static void main(String args[]) throws SQLException, 
ClassNotFoundException {

SparkConf sc = new 
SparkConf().setAppName("SASSparkJdbcTest").setMaster("local");
@SuppressWarnings("resource")
JavaSparkContext jsc = new JavaSparkContext(sc);
HiveContext hiveContext = new 
org.apache.spark.sql.hive.HiveContext(jsc.sc());

Properties props = new Properties();
props.setProperty("user", "Ajay");
props.setProperty("password", "Ajay");
props.setProperty("librefs", "sasLib '/export/home/Ajay'");
props.setProperty("usesspi", "none");
props.setProperty("encryptionPolicy", "required");
props.setProperty("encryptionAlgorithms", "AES");

Class.forName("com.sas.rio.MVADriver");

Map options = new HashMap();
options.put("driver", "com.sas.rio.MVADriver");

DataFrame jdbcDF = 
hiveContext.read().jdbc("jdbc:sasiom://remote1.system.com:8594","sasLib.run_control",
 props);
jdbcDF.show();
}

}





Spark Log
-

[2016-06-10 10:28:26,812] INFO Running Spark version 1.5.0 
(org.apache.spark.SparkContext:59)
[2016-06-10 10:28:27,024] WARN Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable 
(org.apache.hadoop.util.NativeCodeLoader:62)
[2016-06-10 10:28:27,588] INFO Changing view acls to: Ajay 
(org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:27,589] INFO Changing modify acls to: Ajay 
(org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:27,589] INFO SecurityManager: authentication disabled; ui 
acls disabled; users with view permissions: Set(Ajay); users with modify 
permissions: Set(AE10302) (org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:28,012] INFO Slf4jLogger started 
(akka.event.slf4j.Slf4jLogger:80)
[2016-06-10 10:28:28,039] INFO Starting remoting (Remoting:74)
[2016-06-10 10:28:28,153] INFO Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@10.0.0.9:49499] (Remoting:74)
[2016-06-10 10:28:28,158] INFO Successfully started service 

Re: word2vec: how to save an mllib model and reload it?

2016-06-10 Thread sharad82
I am having problem in serializing a ML word2vec model. 

Am I doing something wrong ?


http://stackoverflow.com/questions/37723308/spark-ml-word2vec-serialization-issues

  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329p27137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
So that's a bit complicated - you might want to start with reading the code
for the existing algorithms and go from there. If your goal is to
contribute the algorithm to Spark you should probably take a look at the
JIRA as well as the contributing to Spark guide on the wiki. Also we have a
seperate list (dev@) for people looking to work on Spark it's self.
I'd also recommend a smaller first project building new functionality in
Spark as a good starting point rather than adding a new algorithm right
away, since you learn a lot in the process of making your first
contribution.

On Friday, June 10, 2016, Ram Krishna  wrote:

> Hi All,
>
> How to add new ML algo in Spark MLlib.
>
> On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna  > wrote:
>
>> Hi All,
>>
>> I am new to this this field, I want to implement new ML algo using Spark
>> MLlib. What is the procedure.
>>
>> --
>> Regards,
>> Ram Krishna KT
>>
>>
>>
>>
>>
>>
>
>
> --
> Regards,
> Ram Krishna KT
>
>
>
>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Kazuaki Ishizaki
Hi
Yin Huai's slide is avaiable at 
http://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s-optimizer

Kazuaki Ishizaki



From:   Takeshi Yamamuro 
To: Srinivasan Hariharan02 
Cc: "user@spark.apache.org" 
Date:   2016/06/10 18:09
Subject:Re: Catalyst optimizer cpu/Io cost



Hi,

There no way to retrieve that information in spark.
In fact,  the current optimizer only consider the byte size of outputs in 
LogicalPlan.
Related code can be found in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L90

If you want to know more about catalyst, you can check the Yin Huai's 
slide in spark summit 2016. 
https://spark-summit.org/2016/speakers/yin-huai/
# Note: the slide is not available now, and it seems it will be in a few 
weeks.

// maropu


On Fri, Jun 10, 2016 at 3:29 PM, Srinivasan Hariharan02 <
srinivasan_...@infosys.com> wrote:
Hi,,
 
How can I get spark sql query cpu and Io cost after optimizing for the 
best logical plan. Is there any api to retrieve this information?. If 
anyone point me to the code where actually cpu and Io cost computed in 
catalyst module. 
 
Regards,
Srinivasan Hariharan
+91-9940395830
 
 
 



-- 
---
Takeshi Yamamuro




Java MongoDB Spark Stratio (Please give me a hint)

2016-06-10 Thread Umair Janjua
Hi my code,
When i run this program is gets stuck at
sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();
line and then it does not proceed forward. Nothing happens after that. What
should I do? How can I debug it. I am stuck here. Please any hint would be
appreciated.

-
JavaSparkContext sc = new JavaSparkContext("local[*]", "test
spark-mongodb java");
SQLContext sqlContext = new SQLContext(sc);

Map options = new HashMap();
options.put("host", "host:port");
options.put("database", "database");
options.put("collection", "collectionName");
options.put("credentials", "username,database,password");

System.out.println("Check1");
DataFrame df =
sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();

sqlContext.sql("SELECT * FROM collectionName");
System.out.println("Check2");
df.count();
System.out.println("Check DataFrame Count: " + df.count());
System.out.println("Check3");
df.registerTempTable("collectionName");
df.show();

-

The above code only gets printed till Check1 and then it gets stuck and
nothing happens.

Cheers


RE: Saving Parquet files to S3

2016-06-10 Thread Ankur Jain
Thanks maropu.. It worked…

From: Takeshi Yamamuro [mailto:linguin@gmail.com]
Sent: 10 June 2016 11:47 AM
To: Ankur Jain
Cc: user@spark.apache.org
Subject: Re: Saving Parquet files to S3

Hi,

You'd better off `setting parquet.block.size`.

// maropu

On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann 
> wrote:
I don't believe there's anyway to output files of a specific size. What you can 
do is partition your data into a number of partitions such that the amount of 
data they each contain is around 1 GB.

On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain 
> wrote:
Hello Team,

I want to write parquet files to AWS S3, but I want to size each file size to 1 
GB.
Can someone please guide me on how I can achieve the same?

I am using AWS EMR with spark 1.6.1.

Thanks,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this 
mail from your records.




--
---
Takeshi Yamamuro
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All,

How to add new ML algo in Spark MLlib.

On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna 
wrote:

> Hi All,
>
> I am new to this this field, I want to implement new ML algo using Spark
> MLlib. What is the procedure.
>
> --
> Regards,
> Ram Krishna KT
>
>
>
>
>
>


-- 
Regards,
Ram Krishna KT


RE: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Thanks Mich for your reply. I am curious to know one thing,  Hive uses CBO 
which take into account of cpu cost, Does hive optimizer has any advantage over 
spark catalyst optimizer?.

Regards,
Srinivasan Hariharan
+91-9940395830

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, June 10, 2016 3:27 PM
To: Srinivasan Hariharan02 
Cc: Takeshi Yamamuro ; user@spark.apache.org
Subject: Re: Catalyst optimizer cpu/Io cost

in an SMP system such as Oracle or Sybase the CBO will take into account LIO, 
PIO and CPU costing or use some empirical  costing.

In a distributed system like Spark with so many nodes that may not be that easy 
or its contribution to the Catalyst decision may be subject to variations that 
may not make it worthwhile.

HTH


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 10 June 2016 at 10:45, Srinivasan Hariharan02 
> wrote:
Thanks Takeshi. Is there any reason for not using I/o cpu cost in catalyst 
optimizer?.  Some sql engines which  leverages  Apache calcite has cost planner 
like volcanoPlanner which takes cpu and io cost for plan optimization.

Regards,
Srinivasan Hariharan
+91-9940395830

From: Takeshi Yamamuro 
[mailto:linguin@gmail.com]
Sent: Friday, June 10, 2016 2:38 PM
To: Srinivasan Hariharan02 
>
Cc: user@spark.apache.org
Subject: Re: Catalyst optimizer cpu/Io cost

Hi,

There no way to retrieve that information in spark.
In fact,  the current optimizer only consider the byte size of outputs in 
LogicalPlan.
Related code can be found in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L90

If you want to know more about catalyst, you can check the Yin Huai's slide in 
spark summit 2016.
https://spark-summit.org/2016/speakers/yin-huai/
# Note: the slide is not available now, and it seems it will be in a few weeks.

// maropu


On Fri, Jun 10, 2016 at 3:29 PM, Srinivasan Hariharan02 
> wrote:
Hi,,

How can I get spark sql query cpu and Io cost after optimizing for the best 
logical plan. Is there any api to retrieve this information?. If anyone point 
me to the code where actually cpu and Io cost computed in catalyst module.

Regards,
Srinivasan Hariharan
+91-9940395830






--
---
Takeshi Yamamuro



Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Mich Talebzadeh
in an SMP system such as Oracle or Sybase the CBO will take into account
LIO, PIO and CPU costing or use some empirical  costing.

In a distributed system like Spark with so many nodes that may not be that
easy or its contribution to the Catalyst decision may be subject to
variations that may not make it worthwhile.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 10 June 2016 at 10:45, Srinivasan Hariharan02  wrote:

> Thanks Takeshi. Is there any reason for not using I/o cpu cost in catalyst
> optimizer?.  Some sql engines which  leverages  Apache calcite has cost
> planner like volcanoPlanner which takes cpu and io cost for plan
> optimization.
>
>
>
> *Regards,*
>
> *Srinivasan Hariharan*
>
> *+91-9940395830 <%2B91-9940395830>*
>
>
>
> *From:* Takeshi Yamamuro [mailto:linguin@gmail.com]
> *Sent:* Friday, June 10, 2016 2:38 PM
> *To:* Srinivasan Hariharan02 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Catalyst optimizer cpu/Io cost
>
>
>
> Hi,
>
>
>
> There no way to retrieve that information in spark.
>
> In fact,  the current optimizer only consider the byte size of outputs in
> LogicalPlan.
>
> Related code can be found in
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L90
>
>
>
> If you want to know more about catalyst, you can check the Yin Huai's
> slide in spark summit 2016.
>
> https://spark-summit.org/2016/speakers/yin-huai/
>
> # Note: the slide is not available now, and it seems it will be in a few
> weeks.
>
>
>
> // maropu
>
>
>
>
>
> On Fri, Jun 10, 2016 at 3:29 PM, Srinivasan Hariharan02 <
> srinivasan_...@infosys.com> wrote:
>
> Hi,,
>
>
>
> How can I get spark sql query cpu and Io cost after optimizing for the
> best logical plan. Is there any api to retrieve this information?. If
> anyone point me to the code where actually cpu and Io cost computed in
> catalyst module.
>
>
>
> *Regards,*
>
> *Srinivasan Hariharan*
>
> *+91-9940395830 <%2B91-9940395830>*
>
>
>
>
>
>
>
>
>
>
>
> --
>
> ---
> Takeshi Yamamuro
>


Re: JavaDStream to Dataframe: Java

2016-06-10 Thread Alexander Krasheninnikov
Hello!
While operating the JavaDStream you may use a transform() or foreach()
methods, which give you an access to an RDD.

JavaDStream dataFrameStream =
ctx.textFileStream("source").transform(new Function2() {
@Override
public JavaRDD call(JavaRDD incomingRdd, Time
batchTime) throws Exception {
// Get an API for operating DataFrames
HiveContext ctx = new HiveContext(incomingRdd.context());
// create a schema for DataFrame (declare columns)
StructType schema = null;
// map incoming data into RDD of DataFrame's rows
JavaRDD rowsRdd = incomingRdd.map(rddMember -> new
GenericRow(100));
// DataFrame creation
DataFrame df = ctx.createDataFrame(rowsRdd, schema);

// here you may perform some operations on df, or return it as a stream

return df.toJavaRDD();
}
});



On Fri, Jun 3, 2016 at 5:44 PM, Zakaria Hili  wrote:

> Hi,
> I m newbie in spark and I want to ask you a simple question.
> I have an JavaDStream which contains data selected from sql database.
> something like (id, user, score ...)
> and I want to convert the JavaDStream to a dataframe .
>
> how can I do this with java ?
> Thank you
> ᐧ
>


Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Asfandyar Ashraf Malik
Hi,
I did not notice that I put it twice.
I changed that and ran my program but it still gives the same error:

java.lang.NoSuchMethodError:
org.apache.spark.sql.catalyst.ScalaReflection$.typeOfObject()Lscala/PartialFunction;


Cheers



2016-06-10 11:47 GMT+02:00 Alonso Isidoro Roman :

> why *spark-mongodb_2.11 dependency is written twice in pom.xml?*
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik  >:
>
>> Hi,
>> I am using Stratio library to get MongoDB to work with Spark but I get
>> the following error:
>>
>> java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection
>>
>> This is my code.
>>
>> ---
>> *public static void main(String[] args) {*
>>
>> *JavaSparkContext sc = new JavaSparkContext("local[*]", "test
>> spark-mongodb java"); *
>> *SQLContext sqlContext = new SQLContext(sc); *
>>
>> *Map options = new HashMap(); *
>> *options.put("host", "xyz.mongolab.com:59107
>> "); *
>> *options.put("database", "heroku_app3525385");*
>> *options.put("collection", "datalog");*
>> *options.put("credentials", "*,,");*
>>
>> *DataFrame df =
>> sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
>> *df.registerTempTable("datalog"); *
>> *df.show();*
>>
>> *}*
>>
>> ---
>> My pom file is as follows:
>>
>>  **
>> **
>> *org.apache.spark*
>> *spark-core_2.11*
>> *${spark.version}*
>> **
>> **
>> *org.apache.spark*
>> *spark-catalyst_2.11 *
>> *${spark.version}*
>> **
>> **
>> *org.apache.spark*
>> *spark-sql_2.11*
>> *${spark.version}*
>> * *
>> **
>> *com.stratio.datasource*
>> *spark-mongodb_2.11*
>> *0.10.3*
>> **
>> **
>> *com.stratio.datasource*
>> *spark-mongodb_2.11*
>> *0.10.3*
>> *jar*
>> **
>> **
>>
>>
>> Regards
>>
>
>


RE: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Thanks Takeshi. Is there any reason for not using I/o cpu cost in catalyst 
optimizer?.  Some sql engines which  leverages  Apache calcite has cost planner 
like volcanoPlanner which takes cpu and io cost for plan optimization.

Regards,
Srinivasan Hariharan
+91-9940395830

From: Takeshi Yamamuro [mailto:linguin@gmail.com]
Sent: Friday, June 10, 2016 2:38 PM
To: Srinivasan Hariharan02 
Cc: user@spark.apache.org
Subject: Re: Catalyst optimizer cpu/Io cost

Hi,

There no way to retrieve that information in spark.
In fact,  the current optimizer only consider the byte size of outputs in 
LogicalPlan.
Related code can be found in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L90

If you want to know more about catalyst, you can check the Yin Huai's slide in 
spark summit 2016.
https://spark-summit.org/2016/speakers/yin-huai/
# Note: the slide is not available now, and it seems it will be in a few weeks.

// maropu


On Fri, Jun 10, 2016 at 3:29 PM, Srinivasan Hariharan02 
> wrote:
Hi,,

How can I get spark sql query cpu and Io cost after optimizing for the best 
logical plan. Is there any api to retrieve this information?. If anyone point 
me to the code where actually cpu and Io cost computed in catalyst module.

Regards,
Srinivasan Hariharan
+91-9940395830






--
---
Takeshi Yamamuro


Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Alonso Isidoro Roman
why *spark-mongodb_2.11 dependency is written twice in pom.xml?*

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik :

> Hi,
> I am using Stratio library to get MongoDB to work with Spark but I get the
> following error:
>
> java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection
>
> This is my code.
>
> ---
> *public static void main(String[] args) {*
>
> *JavaSparkContext sc = new JavaSparkContext("local[*]", "test
> spark-mongodb java"); *
> *SQLContext sqlContext = new SQLContext(sc); *
>
> *Map options = new HashMap(); *
> *options.put("host", "xyz.mongolab.com:59107
> "); *
> *options.put("database", "heroku_app3525385");*
> *options.put("collection", "datalog");*
> *options.put("credentials", "*,,");*
>
> *DataFrame df =
> sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
> *df.registerTempTable("datalog"); *
> *df.show();*
>
> *}*
>
> ---
> My pom file is as follows:
>
>  **
> **
> *org.apache.spark*
> *spark-core_2.11*
> *${spark.version}*
> **
> **
> *org.apache.spark*
> *spark-catalyst_2.11 *
> *${spark.version}*
> **
> **
> *org.apache.spark*
> *spark-sql_2.11*
> *${spark.version}*
> * *
> **
> *com.stratio.datasource*
> *spark-mongodb_2.11*
> *0.10.3*
> **
> **
> *com.stratio.datasource*
> *spark-mongodb_2.11*
> *0.10.3*
> *jar*
> **
> **
>
>
> Regards
>


Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Asfandyar Ashraf Malik
Hi,
I am using Stratio library to get MongoDB to work with Spark but I get the
following error:

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection

This is my code.
---
*public static void main(String[] args) {*

*JavaSparkContext sc = new JavaSparkContext("local[*]", "test
spark-mongodb java"); *
*SQLContext sqlContext = new SQLContext(sc); *

*Map options = new HashMap(); *
*options.put("host", "xyz.mongolab.com:59107
"); *
*options.put("database", "heroku_app3525385");*
*options.put("collection", "datalog");*
*options.put("credentials", "*,,");*

*DataFrame df =
sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
*df.registerTempTable("datalog"); *
*df.show();*

*}*
---
My pom file is as follows:

 **
**
*org.apache.spark*
*spark-core_2.11*
*${spark.version}*
**
**
*org.apache.spark*
*spark-catalyst_2.11 *
*${spark.version}*
**
**
*org.apache.spark*
*spark-sql_2.11*
*${spark.version}*
* *
**
*com.stratio.datasource*
*spark-mongodb_2.11*
*0.10.3*
**
**
*com.stratio.datasource*
*spark-mongodb_2.11*
*0.10.3*
*jar*
**
**


Regards


Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Takeshi Yamamuro
Hi,

There no way to retrieve that information in spark.
In fact,  the current optimizer only consider the byte size of outputs in
LogicalPlan.
Related code can be found in
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L90

If you want to know more about catalyst, you can check the Yin Huai's slide
in spark summit 2016.
https://spark-summit.org/2016/speakers/yin-huai/
# Note: the slide is not available now, and it seems it will be in a few
weeks.

// maropu


On Fri, Jun 10, 2016 at 3:29 PM, Srinivasan Hariharan02 <
srinivasan_...@infosys.com> wrote:

> Hi,,
>
> How can I get spark sql query cpu and Io cost after optimizing for the
> best logical plan. Is there any api to retrieve this information?. If
> anyone point me to the code where actually cpu and Io cost computed in
> catalyst module.
>
> *Regards,*
> *Srinivasan Hariharan*
> *+91-9940395830 <%2B91-9940395830>*
>
>
>
>



-- 
---
Takeshi Yamamuro


Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi,

I think if we try to see why is Query 2 faster than Query 1 then all the
answers will be given without beating around the bush. That is the right
way to find out what is happening and why.


Regards,
Gourav

On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue  wrote:

> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta 
> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> ---
> ENVIRONMENT:
> ---
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
> --
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>


Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
Hi Ram,

Not super certain what you are looking to do. Are you looking to add a new
algorithm to Spark MLlib for streaming or use Spark MLlib on streaming data?

Cheers,

Holden

On Friday, June 10, 2016, Ram Krishna  wrote:

> Hi All,
>
> I am new to this this field, I want to implement new ML algo using Spark
> MLlib. What is the procedure.
>
> --
> Regards,
> Ram Krishna KT
>
>
>
>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All,

I am new to this this field, I want to implement new ML algo using Spark
MLlib. What is the procedure.

-- 
Regards,
Ram Krishna KT


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak
Hello,

Looks like you are hitting this:
https://issues.apache.org/jira/browse/HIVE-11940.

Thanks,
Bijay



On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh 
wrote:

> cam you provide a code snippet of how you are populating the target table
> from temp table.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 23:43, swetha kasireddy 
> wrote:
>
>> No, I am reading the data from hdfs, transforming it , registering the
>> data in a temp table using registerTempTable and then doing insert
>> overwrite using Spark SQl' hiveContext.
>>
>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> how are you doing the insert? from an existing table?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>>>
 How many workers (/cpu cores) are assigned to this job?

 2016-06-09 13:01 GMT-07:00 SRK :

> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet
> at a
> time using Spark SQL? It seems to be not performant when I try to
> insert
> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
> issue?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv
I'm using EC2 instances 

Thank you.
Daniel

> On 9 Jun 2016, at 16:49, Gourav Sengupta  wrote:
> 
> Hi,
> 
> are you using EC2 instances or local cluster behind firewall.
> 
> 
> Regards,
> Gourav Sengupta
> 
>> On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv 
>>  wrote:
>> Hi,
>> I'm trying to create a table on s3a but I keep hitting the following error:
>> Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: 
>> MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: 
>> Unable to load AWS credentials from any provider in the chain)
>>  
>> I tried setting the s3a keys using the configuration object but I might be 
>> hitting SPARK-11364 :
>> conf.set("fs.s3a.access.key", accessKey)
>> conf.set("fs.s3a.secret.key", secretKey)
>> conf.set("spark.hadoop.fs.s3a.access.key",accessKey)
>> conf.set("spark.hadoop.fs.s3a.secret.key",secretKey)
>> val sc = new SparkContext(conf)
>>  
>> I tried setting these propeties in hdfs-site.xml but i'm still getting this 
>> error.
>> Finally I tried to set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY 
>> environment variables but with no luck.
>>  
>> Any ideas on how to resolve this issue ?
>>  
>> Thank you.
>> Daniel
>> 
>> Thank you.
>> Daniel
> 


RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-10 Thread Ravi Aggarwal
Hi Ted,
Thanks for the reply.

Here is the code
Btw – df.count is running fine on dataframe generated from this default source. 
I think it is something in the combination of join and hbase data source that 
is creating issue. But not sure entirely.
I have also dumped the physical plans of both approaches s3a/s3a join and 
s3a/hbase join, In case you want that let me know.

import org.apache.hadoop.fs.FileStatus
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat}
import org.apache.hadoop.hbase._
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.CatalystTypeConverters
import org.apache.spark.sql.catalyst.expressions.{Cast, Literal}
import org.apache.spark.sql.execution.datasources.{OutputWriterFactory, 
FileFormat}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.slf4j.LoggerFactory

class DefaultSource extends SchemaRelationProvider with FileFormat {

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, 
String], schema: StructType) = {
new HBaseRelation(schema, parameters)(sqlContext)
  }

  def inferSchema(sparkSession: SparkSession,
  options: Map[String, String],
  files: Seq[FileStatus]): Option[StructType] = ???

  def prepareWrite(sparkSession: SparkSession,
   job: Job,
   options: Map[String, String],
   dataSchema: StructType): OutputWriterFactory = ???
}

object HBaseConfigurationUtil {
  lazy val logger = LoggerFactory.getLogger("HBaseConfigurationUtil")
  val hbaseConfiguration = (tableName: String, hbaseQuorum: String) => {
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, tableName)
conf.set("hbase.mapred.outputtable", tableName)
conf.set("hbase.zookeeper.quorum", hbaseQuorum)
conf
  }
}

class HBaseRelation(val schema: StructType, parameters: Map[String, String])
   (@transient val sqlContext: SQLContext) extends BaseRelation 
with TableScan {

  import sqlContext.sparkContext

  override def buildScan(): RDD[Row] = {

val bcDataSchema = sparkContext.broadcast(schema)

val tableName = parameters.get("path") match {
  case Some(t) => t
  case _ => throw new RuntimeException("Table name (path) not provided in 
parameters")
}

val hbaseQuorum = parameters.get("hbaseQuorum") match {
  case Some(s: String) => s
  case _ => throw new RuntimeException("hbaseQuorum not provided in 
options")
}

val rdd = sparkContext.newAPIHadoopRDD(
  HBaseConfigurationUtil.hbaseConfiguration(tableName, hbaseQuorum),
  classOf[TableInputFormat],
  classOf[ImmutableBytesWritable],
  classOf[Result]
)

val rowRdd = rdd
  .map(tuple => tuple._2)
  .map { record =>

  val cells: java.util.List[Cell] = record.listCells()

  val splitRec = 
cells.toArray.foldLeft(Array(CellUtil.cloneRow(cells.get(0 {(a, b) =>
a :+ CellUtil.cloneValue(b.asInstanceOf[Cell])
  }

  val keyFieldName = bcDataSchema.value.fields.filter(e => 
e.metadata.contains("isPrimary") && e.metadata.getBoolean("isPrimary"))(0).name

  val schemaArr = cells.toArray.foldLeft(Array(keyFieldName)) {(a, b) => {
val fieldCell = b.asInstanceOf[Cell]
a :+ new 
String(fieldCell.getQualifierArray).substring(fieldCell.getQualifierOffset, 
fieldCell.getQualifierLength + fieldCell.getQualifierOffset)
  }
  }

  val res = Map(schemaArr.zip(splitRec).toArray: _*)

  val recordFields = res.map(value => {
val colDataType =
  try {
bcDataSchema.value.fields.filter(_.name == value._1)(0).dataType
  } catch {
case e: ArrayIndexOutOfBoundsException => throw new 
RuntimeException("Schema doesn't contain the fieldname")
  }
CatalystTypeConverters.convertToScala(
  Cast(Literal(value._2), colDataType).eval(),
  colDataType)
  }).toArray
  Row(recordFields: _*)
}

rowRdd
  }
}

Thanks
Ravi

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Thursday, June 9, 2016 7:56 PM
To: Ravi Aggarwal 
Cc: user 
Subject: Re: OutOfMemory when doing joins in spark 2.0 while same code runs 
fine in spark 1.5.2

bq. Read data from hbase using custom DefaultSource (implemented using 
TableScan)

Did you use the DefaultSource from hbase-spark module in hbase master branch ?
If you wrote your own, mind sharing related code ?

Thanks

On Thu, Jun 9, 2016 at 2:53 AM, raaggarw 
> wrote:
Hi,

I was trying to port my code from spark 1.5.2 to spark 2.0 however i faced
some outofMemory issues. On drilling down i could see that OOM is because of
join, because removing 

[no subject]

2016-06-10 Thread pooja mehta
Hi,

How to use scala UDF with the help of Beeline client.
With the help of spark shell, we register our UDF like this:-
sqlcontext.udf.register().
What is the way to use UDF in beeline client.

Thanks
Pooja


Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Hi,,

How can I get spark sql query cpu and Io cost after optimizing for the best 
logical plan. Is there any api to retrieve this information?. If anyone point 
me to the code where actually cpu and Io cost computed in catalyst module.

Regards,
Srinivasan Hariharan
+91-9940395830





Re: Saving Parquet files to S3

2016-06-10 Thread Takeshi Yamamuro
Hi,

You'd better off `setting parquet.block.size`.

// maropu

On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann  wrote:

> I don't believe there's anyway to output files of a specific size. What
> you can do is partition your data into a number of partitions such that the
> amount of data they each contain is around 1 GB.
>
> On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain  wrote:
>
>> Hello Team,
>>
>>
>>
>> I want to write parquet files to AWS S3, but I want to size each file
>> size to 1 GB.
>>
>> Can someone please guide me on how I can achieve the same?
>>
>>
>>
>> I am using AWS EMR with spark 1.6.1.
>>
>>
>>
>> Thanks,
>>
>> Ankur
>> Information transmitted by this e-mail is proprietary to YASH
>> Technologies and/ or its Customers and is intended for use only by the
>> individual or entity to which it is addressed, and may contain information
>> that is privileged, confidential or exempt from disclosure under applicable
>> law. If you are not the intended recipient or it appears that this mail has
>> been forwarded to you without proper authority, you are notified that any
>> use or dissemination of this information in any manner is strictly
>> prohibited. In such cases, please notify us immediately at i...@yash.com
>> and delete this mail from your records.
>>
>
>


-- 
---
Takeshi Yamamuro