Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
Hi,

This behaviour seems to be expected because you must ensure `b + zero() = b`
The your case `b + null = null` breaks this rule.
This is the same with v1.6.1.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57

// maropu


On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela  wrote:

> Sometimes, the BUF for the aggregator may depend on the actual input.. and
> while this passes the responsibility to handle null in merge/reduce to the
> developer, it sounds fine to me if he is the one who put null in zero()
> anyway.
> Now, it seems that the aggregation is skipped entirely when zero() =
> null. Not sure if that was the behaviour in 1.6
>
> Is this behaviour wanted ?
>
> Thanks,
> Amit
>
> Aggregator example:
>
> public static class Agg extends Aggregator, Integer, 
> Integer> {
>
>   @Override
>   public Integer zero() {
> return null;
>   }
>
>   @Override
>   public Integer reduce(Integer b, Tuple2 a) {
> if (b == null) {
>   b = 0;
> }
> return b + a._2();
>   }
>
>   @Override
>   public Integer merge(Integer b1, Integer b2) {
> if (b1 == null) {
>   return b2;
> } else if (b2 == null) {
>   return b1;
> } else {
>   return b1 + b2;
> }
>   }
>
>


-- 
---
Takeshi Yamamuro


add multiple columns

2016-06-26 Thread pseudo oduesp
Hi who i can add multiple columns to data frame

withcolumns allow to add one columns but when you  have multiple  i have to
loop on eache columns ?

thanks


Re: add multiple columns

2016-06-26 Thread ndjido
Hi guy!

I'm afraid you have to loop...The update of the Logical Plan is getting faster 
on Spark. 

Cheers, 
Ardo.

Sent from my iPhone

> On 26 Jun 2016, at 14:20, pseudo oduesp  wrote:
> 
> Hi who i can add multiple columns to data frame 
> 
> withcolumns allow to add one columns but when you  have multiple  i have to 
> loop on eache columns ?
> 
> thanks 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



alter table with hive context

2016-06-26 Thread pseudo oduesp
Hi,
how i can alter table by adiing new columns  to table in hivecontext ?


Re: add multiple columns

2016-06-26 Thread ayan guha
Can you share an example? You may want to write a sql stmt to add the
columns>?

On Sun, Jun 26, 2016 at 11:02 PM,  wrote:

> Hi guy!
>
> I'm afraid you have to loop...The update of the Logical Plan is getting
> faster on Spark.
>
> Cheers,
> Ardo.
>
> Sent from my iPhone
>
> > On 26 Jun 2016, at 14:20, pseudo oduesp  wrote:
> >
> > Hi who i can add multiple columns to data frame
> >
> > withcolumns allow to add one columns but when you  have multiple  i have
> to loop on eache columns ?
> >
> > thanks
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
No, TypedAggregateExpression that uses Aggregator#zero is different between
v2.0 and v1.6.
v2.0:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91
v1.6:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115

// maropu


On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela  wrote:

> This "if (value == null)" condition you point to exists in 1.6 branch as
> well, so that's probably not the reason.
>
> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro 
> wrote:
>
>> Whatever it is, this is expected; if an initial value is null, spark
>> codegen removes all the aggregates.
>> See:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199
>>
>> // maropu
>>
>> On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela  wrote:
>>
>>> Not sure about what's the rule in case of `b + null = null` but the same
>>> code works perfectly in 1.6.1, just tried it..
>>>
>>> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Hi,

 This behaviour seems to be expected because you must ensure `b + zero()
 = b`
 The your case `b + null = null` breaks this rule.
 This is the same with v1.6.1.
 See:
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57

 // maropu


 On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela 
 wrote:

> Sometimes, the BUF for the aggregator may depend on the actual input..
> and while this passes the responsibility to handle null in merge/reduce to
> the developer, it sounds fine to me if he is the one who put null in 
> zero()
> anyway.
> Now, it seems that the aggregation is skipped entirely when zero() =
> null. Not sure if that was the behaviour in 1.6
>
> Is this behaviour wanted ?
>
> Thanks,
> Amit
>
> Aggregator example:
>
> public static class Agg extends Aggregator, 
> Integer, Integer> {
>
>   @Override
>   public Integer zero() {
> return null;
>   }
>
>   @Override
>   public Integer reduce(Integer b, Tuple2 a) {
> if (b == null) {
>   b = 0;
> }
> return b + a._2();
>   }
>
>   @Override
>   public Integer merge(Integer b1, Integer b2) {
> if (b1 == null) {
>   return b2;
> } else if (b2 == null) {
>   return b1;
> } else {
>   return b1 + b2;
> }
>   }
>
>


 --
 ---
 Takeshi Yamamuro

>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


-- 
---
Takeshi Yamamuro


Running of Continuous Aggregation example

2016-06-26 Thread Chang Lim
Has anyone been able to run the code in  The Future of Real-Time in Spark
Slide
24 :"Continuous Aggregation"?

Specifically, the line: stream("jdbc:mysql//..."), 

Using Spark 2.0 preview build, I am getting the error when writing to MySQL:
Exception in thread "main" java.lang.UnsupportedOperationException: Data
source jdbc does not support streamed writing
at
org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:201)

My code:
 val logsDF = sparkSession.read.format("json")
 
.stream("file:///xxx/xxx/spark-2.0.0-preview-bin-hadoop2.4/examples/src/main/resources/people.json")
val logsDS = logsDF.as[Person]
   
logsDS.groupBy("name").sum("age").write.format("jdbc").option("checkpointLocation",
"/xxx/xxx/temp").startStream("jdbc:mysql//localhost/test")
  }

Looking at the Spark DataSource.scala source code, looks like only
ParquetFileFormat is supported?  Am I missing something?  What data sources
support streamed write? Is the example code referring to 2.0 features?

Thanks in advanced for your help.

Chang




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-of-Continuous-Aggregation-example-tp27229.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
Not sure about what's the rule in case of `b + null = null` but the same
code works perfectly in 1.6.1, just tried it..

On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro 
wrote:

> Hi,
>
> This behaviour seems to be expected because you must ensure `b + zero() =
> b`
> The your case `b + null = null` breaks this rule.
> This is the same with v1.6.1.
> See:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>
> // maropu
>
>
> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela  wrote:
>
>> Sometimes, the BUF for the aggregator may depend on the actual input..
>> and while this passes the responsibility to handle null in merge/reduce to
>> the developer, it sounds fine to me if he is the one who put null in zero()
>> anyway.
>> Now, it seems that the aggregation is skipped entirely when zero() =
>> null. Not sure if that was the behaviour in 1.6
>>
>> Is this behaviour wanted ?
>>
>> Thanks,
>> Amit
>>
>> Aggregator example:
>>
>> public static class Agg extends Aggregator, Integer, 
>> Integer> {
>>
>>   @Override
>>   public Integer zero() {
>> return null;
>>   }
>>
>>   @Override
>>   public Integer reduce(Integer b, Tuple2 a) {
>> if (b == null) {
>>   b = 0;
>> }
>> return b + a._2();
>>   }
>>
>>   @Override
>>   public Integer merge(Integer b1, Integer b2) {
>> if (b1 == null) {
>>   return b2;
>> } else if (b2 == null) {
>>   return b1;
>> } else {
>>   return b1 + b2;
>> }
>>   }
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
Whatever it is, this is expected; if an initial value is null, spark
codegen removes all the aggregates.
See:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199

// maropu

On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela  wrote:

> Not sure about what's the rule in case of `b + null = null` but the same
> code works perfectly in 1.6.1, just tried it..
>
> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> This behaviour seems to be expected because you must ensure `b + zero() =
>> b`
>> The your case `b + null = null` breaks this rule.
>> This is the same with v1.6.1.
>> See:
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>>
>> // maropu
>>
>>
>> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela  wrote:
>>
>>> Sometimes, the BUF for the aggregator may depend on the actual input..
>>> and while this passes the responsibility to handle null in merge/reduce to
>>> the developer, it sounds fine to me if he is the one who put null in zero()
>>> anyway.
>>> Now, it seems that the aggregation is skipped entirely when zero() =
>>> null. Not sure if that was the behaviour in 1.6
>>>
>>> Is this behaviour wanted ?
>>>
>>> Thanks,
>>> Amit
>>>
>>> Aggregator example:
>>>
>>> public static class Agg extends Aggregator, 
>>> Integer, Integer> {
>>>
>>>   @Override
>>>   public Integer zero() {
>>> return null;
>>>   }
>>>
>>>   @Override
>>>   public Integer reduce(Integer b, Tuple2 a) {
>>> if (b == null) {
>>>   b = 0;
>>> }
>>> return b + a._2();
>>>   }
>>>
>>>   @Override
>>>   public Integer merge(Integer b1, Integer b2) {
>>> if (b1 == null) {
>>>   return b2;
>>> } else if (b2 == null) {
>>>   return b1;
>>> } else {
>>>   return b1 + b2;
>>> }
>>>   }
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


-- 
---
Takeshi Yamamuro


Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
This "if (value == null)" condition you point to exists in 1.6 branch as
well, so that's probably not the reason.

On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro 
wrote:

> Whatever it is, this is expected; if an initial value is null, spark
> codegen removes all the aggregates.
> See:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199
>
> // maropu
>
> On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela  wrote:
>
>> Not sure about what's the rule in case of `b + null = null` but the same
>> code works perfectly in 1.6.1, just tried it..
>>
>> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> This behaviour seems to be expected because you must ensure `b + zero()
>>> = b`
>>> The your case `b + null = null` breaks this rule.
>>> This is the same with v1.6.1.
>>> See:
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>>>
>>> // maropu
>>>
>>>
>>> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela  wrote:
>>>
 Sometimes, the BUF for the aggregator may depend on the actual input..
 and while this passes the responsibility to handle null in merge/reduce to
 the developer, it sounds fine to me if he is the one who put null in zero()
 anyway.
 Now, it seems that the aggregation is skipped entirely when zero() =
 null. Not sure if that was the behaviour in 1.6

 Is this behaviour wanted ?

 Thanks,
 Amit

 Aggregator example:

 public static class Agg extends Aggregator, 
 Integer, Integer> {

   @Override
   public Integer zero() {
 return null;
   }

   @Override
   public Integer reduce(Integer b, Tuple2 a) {
 if (b == null) {
   b = 0;
 }
 return b + a._2();
   }

   @Override
   public Integer merge(Integer b1, Integer b2) {
 if (b1 == null) {
   return b2;
 } else if (b2 == null) {
   return b1;
 } else {
   return b1 + b2;
 }
   }


>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


RE: Logging trait in Spark 2.0

2016-06-26 Thread Paolo Patierno
Yes ... the same here ... I'd like to know the best way for adding logging in a 
custom receiver for Spark Streaming 2.0

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor 
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience

From: jonathaka...@gmail.com
Date: Fri, 24 Jun 2016 20:56:40 +
Subject: Re: Logging trait in Spark 2.0
To: yuzhih...@gmail.com; ppatie...@live.com
CC: user@spark.apache.org

Ted, how is that thread related to Paolo's question?
On Fri, Jun 24, 2016 at 1:50 PM Ted Yu  wrote:
See this related thread:
http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+

On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno  wrote:



Hi,

developing a Spark Streaming custom receiver I noticed that the Logging trait 
isn't accessible anymore in Spark 2.0.

trait Logging in package internal cannot be accessed in package 
org.apache.spark.internal

For developing a custom receiver what is the preferred way for logging ? Just 
using log4j dependency as any other Java/Scala library/application ?

Thanks,
Paolo

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor 
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience  


  

Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Amit Sela
Sometimes, the BUF for the aggregator may depend on the actual input.. and
while this passes the responsibility to handle null in merge/reduce to the
developer, it sounds fine to me if he is the one who put null in zero()
anyway.
Now, it seems that the aggregation is skipped entirely when zero() = null.
Not sure if that was the behaviour in 1.6

Is this behaviour wanted ?

Thanks,
Amit

Aggregator example:

public static class Agg extends Aggregator,
Integer, Integer> {

  @Override
  public Integer zero() {
return null;
  }

  @Override
  public Integer reduce(Integer b, Tuple2 a) {
if (b == null) {
  b = 0;
}
return b + a._2();
  }

  @Override
  public Integer merge(Integer b1, Integer b2) {
if (b1 == null) {
  return b2;
} else if (b2 == null) {
  return b1;
} else {
  return b1 + b2;
}
  }


Re: Spark Task is not created

2016-06-26 Thread Ravindra
I have a lot of spark tests. And the failure is not deterministic. It can
happen at any action that I do. Buy given below logs are common. And I
overcome that using the repartitioning, coalescing etc so that I don't get
that Submitting 2 missing tasks from ShuffleMapStage. Basically ensuring
that there is only one task.

I doubt if this has anything to do with some property.

In the Ui I don't see any failure. Just that few tasks have completed and
the last one is yet to be created

Thanks,

Ravi


On Sun, Jun 26, 2016, 11:33 Akhil Das  wrote:

> Would be good if you can paste the piece of code that you are executing.
>
> On Sun, Jun 26, 2016 at 11:21 AM, Ravindra 
> wrote:
>
>> Hi All,
>>
>> May be I need to just set some property or its a known issue. My spark
>> application hangs in test environment whenever I see following message -
>>
>> 16/06/26 11:13:34 INFO DAGScheduler: *Submitting 2 missing tasks from
>> ShuffleMapStage* 145 (MapPartitionsRDD[590] at rdd at
>> WriteDataFramesDecorator.scala:61)
>> 16/06/26 11:13:34 INFO TaskSchedulerImpl: Adding task set 145.0 with 2
>> tasks
>> 16/06/26 11:13:34 INFO TaskSetManager: Starting task 0.0 in stage 145.0
>> (TID 186, localhost, PROCESS_LOCAL, 2389 bytes)
>> 16/06/26 11:13:34 INFO Executor: Running task 0.0 in stage 145.0 (TID 186)
>> 16/06/26 11:13:34 INFO BlockManager: Found block rdd_575_0 locally
>> 16/06/26 11:13:34 INFO GenerateMutableProjection: Code generated in 3.796
>> ms
>> 16/06/26 11:13:34 INFO Executor: Finished task 0.0 in stage 145.0 (TID
>> 186). 2578 bytes result sent to driver
>> 16/06/26 11:13:34 INFO TaskSetManager: Finished task 0.0 in stage 145.0
>> (TID 186) in 24 ms on localhost (1/2)
>>
>> It happens with any action. The application works fine whenever I notice 
>> "*Submitting
>> 1 missing tasks from ShuffleMapStage". *For this I need to tweak the
>> plan like using repartition, coalesce etc but this also doesn't help
>> always.
>>
>> Some of the Spark properties are as given below -
>>
>> NameValue
>> spark.app.idlocal-1466914377931
>> spark.app.name  SparkTest
>> spark.cores.max  3
>> spark.default.parallelism 1
>> spark.driver.allowMultipleContexts true
>> spark.executor.iddriver
>> spark.externalBlockStore.folderName
>> spark-050049bd-c058-4035-bc3d-2e73a08e8d0c
>> spark.masterlocal[2]
>> spark.scheduler.mode FIFO
>> spark.ui.enabledtrue
>>
>>
>> Thanks,
>> Ravi.
>>
>>
>
>
> --
> Cheers!
>
>


Re: alter table with hive context

2016-06-26 Thread Mich Talebzadeh
-- create the hivecontext

scala>
*val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)*HiveContext:
org.apache.spark.sql.hive.HiveContext =
org.apache.spark.sql.hive.HiveContext@6387fb09

--use the test dastabase

scala>
*HiveContext.sql("use test")*res8: org.apache.spark.sql.DataFrame =
[result: string]

-- create a table

scala>
*var sqltext: String = ""*sqltext: String = ""





















*scala> sqltext = |  """ | CREATE TABLE IF NOT EXISTS
test.sales2 |  ( |   PROD_IDbigint
, |   CUST_IDbigint   , |
TIME_IDtimestamp, |   CHANNEL_ID
bigint   , |   PROMO_ID
bigint   , |   QUANTITY_SOLD
decimal(10)  , |   AMOUNT_SOLDdecimal(10) |
) | CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256
BUCKETS | STORED AS ORC | TBLPROPERTIES (
"orc.compress"="SNAPPY", | "orc.create.index"="true", |
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
| "orc.bloom.filter.fpp"="0.05", | "orc.stripe.size"="268435456", |
"orc.row.index.stride"="1") | """*
scala>
*HiveContext.sql(sqltext)*res12: org.apache.spark.sql.DataFrame = [result:
string]

-- show this table schema

scala>
*HiveContext.sql("desc sales2").show*+-+-+---+
| col_name|data_type|comment|
+-+-+---+
|  prod_id|   bigint|   null|
|  cust_id|   bigint|   null|
|  time_id|timestamp|   null|
|   channel_id|   bigint|   null|
| promo_id|   bigint|   null|
|quantity_sold|decimal(10,0)|   null|
|  amount_sold|decimal(10,0)|   null|
+-+-+---+

--Add a new column to this table


scala>
*HiveContext.sql("ALTER TABLE test.sales2 add columns (new_col
varchar(30))")*res17: org.apache.spark.sql.DataFrame = [result: string]

scala>
*HiveContext.sql("desc sales2").show*+-+-+---+
| col_name|data_type|comment|
+-+-+---+
|  prod_id|   bigint|   null|
|  cust_id|   bigint|   null|
|  time_id|timestamp|   null|
|   channel_id|   bigint|   null|
| promo_id|   bigint|   null|
|quantity_sold|decimal(10,0)|   null|
|  amount_sold|decimal(10,0)|   null|
|  new_col|  varchar(30)|   null|
+-+-+---+


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 26 June 2016 at 13:34, pseudo oduesp  wrote:

> Hi,
> how i can alter table by adiing new columns  to table in hivecontext ?
>
>


Running JavaBased Implementation of StreamingKmeans Spark

2016-06-26 Thread Biplob Biswas
Hi,

Something is wrong with my spark subscription so I can't see the responses
properly on nabble, so I subscribed from a different id, hopefully it is
solved and I am putting my question again here.

I implemented the streamingKmeans example provided in the spark website but
in Java.
The full implementation is here,

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below:

---
Time: 1466176935000 ms
---

Also, i have 2 directories:
"D:\spark\streaming example\Data Sets\training"
"D:\spark\streaming example\Data Sets\test"

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints.

I have already tried putting the data files after starting the code but
still no output, i am also not getting any exception or anything, so its
hard to debug for me.

I am very new to the spark systems and any help is highly appreciated.

Thank you so much
Biplob Biswas


What is the explanation of "ConvertToUnsafe" in "Physical Plan"

2016-06-26 Thread Mich Talebzadeh
Hi,

In Spark's Physical Plan what is the explanation for ConvertToUnsafe?

Example:

scala> sorted.filter($"prod_id" ===13).explain
== Physical Plan ==
Filter (prod_id#10L = 13)
+- Sort [prod_id#10L ASC,cust_id#11L ASC,time_id#12 ASC,channel_id#13L
ASC,promo_id#14L ASC], true, 0
   +- ConvertToUnsafe
  +- Exchange rangepartitioning(prod_id#10L ASC,cust_id#11L
ASC,time_id#12 ASC,channel_id#13L ASC,promo_id#14L ASC,200), None
 +- HiveTableScan
[prod_id#10L,cust_id#11L,time_id#12,channel_id#13L,promo_id#14L],
MetastoreRelation oraclehadoop, sales2, None


My inclination is that it is a temporary construct like tempTable created
as part of Physical Plan?


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark 1.6.1: Unexpected partition behavior?

2016-06-26 Thread Randy Gelhausen
Sorry, please ignore the above.

I now see I called coalesce on a different reference, than I used to
register the table.

On Sun, Jun 26, 2016 at 6:34 PM, Randy Gelhausen  wrote:

> 
> val enriched_web_logs = sqlContext.sql("""
> select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
> source_host, log
> from web_logs
> left outer join (select distinct node, address from nodes) b on source_ip
> = address
> """)
>
> enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
> enriched_web_logs.registerTempTable("enriched_web_logs")
> sqlContext.cacheTable("enriched_web_logs")
> 
>
> There are only 524 records in the resulting table, and I have explicitly
> attempted to coalesce into 1 partition.
>
> Yet my Spark UI shows 200 (mostly empty) partitions:
> RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
> in ExternalBlockStoreSize on Disk
> In-memory table enriched_web_logs
>  Memory Deserialized 1x
> Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be 200 partitions
> despite the coalesce call?
>


Spark 1.6.1: Unexpected partition behavior?

2016-06-26 Thread Randy Gelhausen

val enriched_web_logs = sqlContext.sql("""
select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
source_host, log
from web_logs
left outer join (select distinct node, address from nodes) b on source_ip =
address
""")
enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
enriched_web_logs.registerTempTable("enriched_web_logs")
sqlContext.cacheTable("enriched_web_logs")


There are only 524 records in the resulting table, and I have explicitly
attempted to coalesce into 1 partition.

Yet my Spark UI shows 200 (mostly empty) partitions:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on Disk
In-memory table enriched_web_logs
 Memory
Deserialized 1x Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be
200 partitions despite the coalesce call?


Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread sychungd
Hi,

Thanks for reply. It's a java web service resides in a jboss container.

HY Chung
Best regards,

S.Y. Chung 鍾學毅
F14MITD
Taiwan Semiconductor Manufacturing Company, Ltd.
Tel: 06-5056688 Ext: 734-6325


|->
|Mich Talebzadeh  |
|   |
| |
| |
| |
|2016/06/24 下午 08:05|
|->
  
>-|
  | 
|
  | 
|
  | 
  To|
  |sychu...@tsmc.com
|
  | 
  cc|
  |"user @spark" 
|
  | 
 Subject|
  |[Spam][SMG] Re: problem running spark with yarn-client not using 
spark-submit|
  | 
|
  | 
|
  | 
|
  | 
|
  | 
|
  
>-|




Hi,
Trying to run spark with yarn-client not using spark-submit here

what are you using to submit the job? spark-shell, spark-sql  or anything
else


Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com




On 24 June 2016 at 12:01,  wrote:

  Hello guys,

  Trying to run spark with yarn-client not using spark-submit here but the
  jobs kept failed while AM launching executor.
  The error collected by yarn like below.
  Looks like some environment setting is missing?
  Could someone help me out with this.

  Thanks  in advance!
  HY Chung

  Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
  MaxPermSize=256m; support was removed in 8.0
  Exception in thread "main" java.lang.NoClassDefFoundError:
  org/apache/spark/Logging
                   at java.lang.ClassLoader.defineClass1(Native Method)
                   at java.lang.ClassLoader.defineClass
  (ClassLoader.java:760)
                   at java.security.SecureClassLoader.defineClass
  (SecureClassLoader.java:142)
                   at java.net.URLClassLoader.defineClass
  (URLClassLoader.java:455)
                   at java.net.URLClassLoader.access$100
  (URLClassLoader.java:73)
                   at java.net.URLClassLoader$1.run
  (URLClassLoader.java:367)
                   at java.net.URLClassLoader$1.run
  (URLClassLoader.java:361)
                   at java.security.AccessController.doPrivileged(Native
  Method)
                   at java.net.URLClassLoader.findClass
  (URLClassLoader.java:360)
                   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
                   at sun.misc.Launcher$AppClassLoader.loadClass
  (Launcher.java:308)
                   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
                   at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
  (ApplicationMaster.scala:674)
                   at org.apache.spark.deploy.yarn.ExecutorLauncher.main
  (ApplicationMaster.scala)
  Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
                   at 

Re: problem running spark with yarn-client not using spark-submit

2016-06-26 Thread Saisai Shao
It means several jars are missing in the yarn container environment, if you
want to submit your application through some other ways besides
spark-submit, you have to take care all the environment things yourself.
Since we don't know your implementation of java web service, so it is hard
to provide useful information. Basically the problem is spark jar is
missing in the AM side, so you have to put spark jars into distributed
cache ahead of time.

On Mon, Jun 27, 2016 at 11:19 AM,  wrote:

> Hi,
>
> Thanks for reply. It's a java web service resides in a jboss container.
>
> HY Chung
> Best regards,
>
> S.Y. Chung 鍾學毅
> F14MITD
> Taiwan Semiconductor Manufacturing Company, Ltd.
> Tel: 06-5056688 Ext: 734-6325
>
>
> |->
> |Mich Talebzadeh  |
> | |m>   |
> | |
> | |
> | |
> |2016/06/24 下午 08:05|
> |->
>
> >-|
>   |
>  |
>   |
>  |
>   |
>To|
>   |sychu...@tsmc.com
>   |
>   |
>cc|
>   |"user @spark" 
>   |
>   |
>   Subject|
>   |[Spam][SMG] Re: problem running spark with yarn-client not
> using spark-submit
>   |
>   |
>  |
>   |
>  |
>   |
>  |
>   |
>  |
>   |
>  |
>
> >-|
>
>
>
>
> Hi,
> Trying to run spark with yarn-client not using spark-submit here
>
> what are you using to submit the job? spark-shell, spark-sql  or anything
> else
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 24 June 2016 at 12:01,  wrote:
>
>   Hello guys,
>
>   Trying to run spark with yarn-client not using spark-submit here but the
>   jobs kept failed while AM launching executor.
>   The error collected by yarn like below.
>   Looks like some environment setting is missing?
>   Could someone help me out with this.
>
>   Thanks  in advance!
>   HY Chung
>
>   Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
>   MaxPermSize=256m; support was removed in 8.0
>   Exception in thread "main" java.lang.NoClassDefFoundError:
>   org/apache/spark/Logging
>at java.lang.ClassLoader.defineClass1(Native Method)
>at java.lang.ClassLoader.defineClass
>   (ClassLoader.java:760)
>at java.security.SecureClassLoader.defineClass
>   (SecureClassLoader.java:142)
>at java.net.URLClassLoader.defineClass
>   (URLClassLoader.java:455)
>at java.net.URLClassLoader.access$100
>   (URLClassLoader.java:73)
>at java.net.URLClassLoader$1.run
>   (URLClassLoader.java:367)
>at java.net.URLClassLoader$1.run
>   (URLClassLoader.java:361)
>at java.security.AccessController.doPrivileged(Native
>   Method)
>at java.net.URLClassLoader.findClass
>   (URLClassLoader.java:360)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>at sun.misc.Launcher$AppClassLoader.loadClass
>   (Launcher.java:308)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
>   (ApplicationMaster.scala:674)
>at org.apache.spark.deploy.yarn.ExecutorLauncher.main
>   (ApplicationMaster.scala)
>   Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
>at java.net.URLClassLoader$1.run
>   (URLClassLoader.java:372)
>at java.net.URLClassLoader$1.run
>   (URLClassLoader.java:361)
>  

How to convert a Random Forest model built in R to a similar model in Spark

2016-06-26 Thread Neha Mehta
Hi All,

Request help with problem mentioned in the mail below. I have an existing
random forest model in R which needs to be deployed on Spark. I am trying
to recreate the model in Spark but facing the problem mentioned below.

Thanks,
Neha

On Jun 24, 2016 5:10 PM, wrote:
>
> Hi Sun,
>
> I am trying to build a model in Spark. Here are the parameters which were
used in R for creating the model, I am unable to figure out how to specify
a similar input to the random forest regressor in Spark so that I get a
similar model in Spark.
>
> https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
>
> mytry=3
>
> ntree=500
>
> importance=TRUE
>
> maxnodes = NULL
>
> On May 31, 2016 7:03 AM, "Sun Rui"  wrote:
>>
>> I mean train random forest model (not using R) and use it for prediction
together using Spark ML.
>>
>>> On May 30, 2016, at 20:15, Neha Mehta  wrote:
>>>
>>> Thanks Sujeet.. will try it out.
>>>
>>> Hi Sun,
>>>
>>> Can you please tell me what did you mean by "Maybe you can try using
the existing random forest model" ? did you mean creating the model again
using Spark MLLIB?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>>

 From: sujeet jog 
 Date: Mon, May 30, 2016 at 4:52 PM
 Subject: Re: Can we use existing R model in Spark
 To: Sun Rui 
 Cc: Neha Mehta , user 


 Try to invoke a R script from Spark using rdd pipe method , get the
work done & and receive the model back in RDD.


 for ex :-
 .   rdd.pipe("")


 On Mon, May 30, 2016 at 3:57 PM, Sun Rui  wrote:
>
> Unfortunately no. Spark does not support loading external modes (for
examples, PMML) for now.
> Maybe you can try using the existing random forest model in Spark.
>
>> On May 30, 2016, at 18:21, Neha Mehta  wrote:
>>
>> Hi,
>>
>> I have an existing random forest model created using R. I want to
use that to predict values on Spark. Is it possible to do the same? If yes,
then how?
>>
>> Thanks & Regards,
>> Neha
>
>


>>>
>>


Unsubscribe

2016-06-26 Thread kchen
Unsubscribe

Difference between Dataframe and RDD Persisting

2016-06-26 Thread Brandon White
What is the difference between persisting a dataframe and a rdd? When I
persist my RDD, the UI says it takes 50G or more of memory. When I persist
my dataframe, the UI says it takes 9G or less of memory.

Does the dataframe not persist the actual content? Is it better / faster to
persist a RDD when doing a lot of filter, mapping, and collecting
operations?


Re: Spark 2.0 Streaming and Event Time

2016-06-26 Thread Chang Lim
Here is an update to my question:
=

Tathagata Das 
Jun 9
to me

Event time is part of windowed aggregation API. See my slides -
https://www.slideshare.net/mobile/databricks/a-deep-dive-into-structured-streaming

Let me know if it helps you to find it. Keeping it short as I am on the
mobile.
==

Chang Lim 
Jun 9
to Tathagata 
Hi TD,

Thanks for the reply.  But I was thinking of "sorting the events by logical
time" - more like what yesterday, the Microsoft presenter introduced
"reorder" in her slide.

The "group by" is aggregation but does not help in processing events based
on event time ordering.
=
Tathagata Das
Jun 9
to me 
Aah that's something still out of scope right now.


Chang Lim 
Jun 9

to Tathagata 
Wonder if we can get Microsoft to contribute "reorder" back to Spark :)

Thanks for your excellent work in Spark.

===
Michael Armbrust 
Jun 10
to user, me 
There is no special setting for event time (though we will be adding one for
setting a watermark in 2.1 to allow us to reduce the amount of state that
needs to be kept around).  Just window/groupBy on the on the column that is
your event time.


Chang Lim 
Jun 10

to Michael, user 
Yes, now I realized that.  I did exchanged emails with TD on this topic. 
The Microsoft presentation at Spark summit ("reorder" function) would be a
good addition to Spark.  Would this feature be on the road map?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-Streaming-and-Event-Time-tp27120p27232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org