Re: how to set the assignee in JIRA please?

2017-07-25 Thread ??????????
I agree not to close the old PRs without right reason.
As your suggestion, reviewing it is the way to colse it.


Thanks.




 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/26 12:15:45
To: "??"<1427357...@qq.com>;
Cc: "user @spark";
Subject: Re: how to set the assignee in JIRA please?


That's waiting for a review as seen. There have been few discussions about 
this. I am personally against closing only because it is old.


I have made periodically PRs to close other inactive PRs (e.g., not responsive 
to review comments or Jenkins failures).


So, I guess most of such PRs are waiting for a review.




How about drawing attention to the PRs you see a value on with a proper 
explanation and investigation, or try to review rather than suggesting just 
closing?


I think anyone in this community can review.







2017-07-26 11:58 GMT+09:00 ?? <1427357...@qq.com>:
Hi all,


I find some PR were created one year ago, the last comment is several monthes 
before.
No one to close or reject it.
Such as 6880, just put it like this?




 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:25:28
To: "??"<1427357...@qq.com>;
Cc: "user @spark";"Marcelo Vanzin";
Subject: Re: how to set the assignee in JIRA please?




I think that's described in the link I used - 
http://spark.apache.org/contributing.html.

On 25 Jul 2017 10:22 am, "??" <1427357...@qq.com> wrote:
Another issue about contribution.


After a pull request is created, what should creator do next please?
Who will close it please?


 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:15:49
To: "Marcelo Vanzin";
Cc: "user";"??"<1427357...@qq.com>;

Subject: Re: how to set the assignee in JIRA please?


I see. In any event, it sounds not required to work on an issue - 
http://spark.apache.org/contributing.html .
"... There is no need to be the Assignee of the JIRA to work on it, though you 
are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal 
anyway.



Thanks for the details.








2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
 > However, I see some JIRAs are assigned to someone time to time. Were those
 > mistakes or would you mind if I ask when someone is assigned?
 
 I'm not sure if there are any guidelines of when to assign; since
 there has been an agreement that bugs should remain unassigned I don't
 think I've personally done it, although I have seen others do it. In
 general I'd say it's ok if there's a good justification for it (e.g.
 "this is a large change and this person who is an active contributor
 will work on it"), but in the general case should be avoided.
 
 I agree it's a little confusing, especially comparing to other
 projects, but it's how it's been done for a couple of years at least
 (or at least what I have understood).
 
 
 --
 Marcelo

Re: how to set the assignee in JIRA please?

2017-07-25 Thread Hyukjin Kwon
That's waiting for a review as seen. There have been few discussions about
this. I am personally against closing only because it is old.

I have made periodically PRs to close other inactive PRs (e.g., not
responsive to review comments or Jenkins failures).

So, I guess most of such PRs are waiting for a review.


How about drawing attention to the PRs you see a value on with a proper
explanation and investigation, or try to review rather than suggesting just
closing?

I think anyone in this community can review.



2017-07-26 11:58 GMT+09:00 萝卜丝炒饭 <1427357...@qq.com>:

> Hi all,
>
> I find some PR were created one year ago, the last comment is several
> monthes before.
> No one to close or reject it.
> Such as 6880, just put it like this?
>
>
> ---Original---
> *From:* "Hyukjin Kwon"
> *Date:* 2017/7/25 09:25:28
> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
> *Cc:* "user @spark";"Marcelo Vanzin"<
> van...@cloudera.com>;
> *Subject:* Re: how to set the assignee in JIRA please?
>
> I think that's described in the link I used - http://spark.apache.org/cont
> ributing.html.
>
> On 25 Jul 2017 10:22 am, "萝卜丝炒饭" <1427357...@qq.com> wrote:
>
> Another issue about contribution.
>
> After a pull request is created, what should creator do next please?
> Who will close it please?
>
> ---Original---
> *From:* "Hyukjin Kwon"
> *Date:* 2017/7/25 09:15:49
> *To:* "Marcelo Vanzin";
> *Cc:* "user";"萝卜丝炒饭"<1427357...@qq.com>;
> *Subject:* Re: how to set the assignee in JIRA please?
>
> I see. In any event, it sounds not required to work on an issue -
> http://spark.apache.org/contributing.html .
>
> "... There is no need to be the Assignee of the JIRA to work on it, though
> you are welcome to comment that you have begun work.."
>
> and I was just wondering out of my curiosity. It should be not a big deal
> anyway.
>
>
> Thanks for the details.
>
>
>
> 2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :
>
>> On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon 
>> wrote:
>> > However, I see some JIRAs are assigned to someone time to time. Were
>> those
>> > mistakes or would you mind if I ask when someone is assigned?
>>
>> I'm not sure if there are any guidelines of when to assign; since
>> there has been an agreement that bugs should remain unassigned I don't
>> think I've personally done it, although I have seen others do it. In
>> general I'd say it's ok if there's a good justification for it (e.g.
>> "this is a large change and this person who is an active contributor
>> will work on it"), but in the general case should be avoided.
>>
>> I agree it's a little confusing, especially comparing to other
>> projects, but it's how it's been done for a couple of years at least
>> (or at least what I have understood).
>>
>>
>> --
>> Marcelo
>>
>
>
>


Re: how to set the assignee in JIRA please?

2017-07-25 Thread ??????????
Hi all,


I find some PR were created one year ago, the last comment is several monthes 
before.
No one to close or reject it.
Such as 6880, just put it like this?




 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:25:28
To: "??"<1427357...@qq.com>;
Cc: "user @spark";"Marcelo Vanzin";
Subject: Re: how to set the assignee in JIRA please?


I think that's described in the link I used - 
http://spark.apache.org/contributing.html.

On 25 Jul 2017 10:22 am, "??" <1427357...@qq.com> wrote:
Another issue about contribution.


After a pull request is created, what should creator do next please?
Who will close it please?


 
---Original---
From: "Hyukjin Kwon"
Date: 2017/7/25 09:15:49
To: "Marcelo Vanzin";
Cc: "user";"??"<1427357...@qq.com>;

Subject: Re: how to set the assignee in JIRA please?


I see. In any event, it sounds not required to work on an issue - 
http://spark.apache.org/contributing.html .
"... There is no need to be the Assignee of the JIRA to work on it, though you 
are welcome to comment that you have begun work.."

and I was just wondering out of my curiosity. It should be not a big deal 
anyway.



Thanks for the details.








2017-07-25 10:09 GMT+09:00 Marcelo Vanzin :
On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon  wrote:
 > However, I see some JIRAs are assigned to someone time to time. Were those
 > mistakes or would you mind if I ask when someone is assigned?
 
 I'm not sure if there are any guidelines of when to assign; since
 there has been an agreement that bugs should remain unassigned I don't
 think I've personally done it, although I have seen others do it. In
 general I'd say it's ok if there's a good justification for it (e.g.
 "this is a large change and this person who is an active contributor
 will work on it"), but in the general case should be avoided.
 
 I agree it's a little confusing, especially comparing to other
 projects, but it's how it's been done for a couple of years at least
 (or at least what I have understood).
 
 
 --
 Marcelo

Need some help around a Spark Error

2017-07-25 Thread Debabrata Ghosh
Hi,
  While executing a SparkSQL query, I am hitting the
following error. Wonder, if you can please help me with a possible cause
and resolution. Here is the stacktrace for the same:

07/25/2017 02:41:58 PM - DataPrep.py 323 - __main__ - ERROR - An error
occurred while calling o49.sql.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 12.2 failed 4 times, most recent failure: Lost task 0.3 in stage
12.2 (TID 2242, bicservices.hdp.com): ExecutorLostFailure (executor 25
exited caused by one of the running tasks) Reason: Slave lost

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1946)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:84)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:201)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)

at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:276)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)

at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)

at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)

at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)

at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)

at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)

at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

Cheers,

Debu


[SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-25 Thread Priyank Shrivastava
I am trying to write key-values to redis using a DataStreamWriter object
using pyspark structured streaming APIs. I am using Spark 2.2

Since the Foreach Sink is not supported for python; here
,
I am trying to find out some alternatives.

One alternative is to write a separate Scala module only to push data into
redis using foreach; ForeachWriter

is
supported in Scala. BUT this doesn't seem like an efficient approach and
adds deployment overhead because now I will have to support Scala in my app.

Another approach is obviously to use Scala instead of python, which is fine
but I want to make sure that I absolutely cannot use python for this
problem before I take this path.

Would appreciate some feedback and alternative design approaches for this
problem.

Thanks.


Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
cool~ Thanks Kang! I will check and let you know.
Sorry for delay as there is an urgent customer issue today.

Best
Martin

2017-07-24 22:15 GMT-07:00 周康 :

> * If the file exists but is a directory rather than a regular file, does
> * not exist but cannot be created, or cannot be opened for any other
> * reason then a FileNotFoundException is thrown.
>
> After searching into FileOutputStream i saw this annotation.So you can check 
> executor node first(may be no permission or no space,or no enough file 
> descriptor)
>
>
> 2017-07-25 13:05 GMT+08:00 周康 :
>
>> You can also check whether space left in the executor node enough to
>> store shuffle file or not.
>>
>> 2017-07-25 13:01 GMT+08:00 周康 :
>>
>>> First,spark will handle task fail so if job ended normally , this error
>>> can be ignore.
>>> Second, when using BypassMergeSortShuffleWriter, it will first write
>>> data file then write an index file.
>>> You can check "Failed to delete temporary index file at" or "fail to
>>> rename file" in related executor node's log file.
>>>
>>> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>>>
 Is there anyone at share me some lights about this issue?

 Thanks
 Martin

 2017-07-21 18:58 GMT-07:00 Martin Peng :

> Hi,
>
> I have several Spark jobs including both batch job and Stream jobs to
> process the system log and analyze them. We are using Kafka as the 
> pipeline
> to connect each jobs.
>
> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some
> of the jobs(both batch or streaming) are thrown below exceptions
> randomly(either after several hours run or just run in 20 mins). Can 
> anyone
> give me some suggestions about how to figure out the real root cause?
> (Looks like google result is not very useful...)
>
> Thanks,
> Martin
>
> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
> -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-
> 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c356
> 43e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
> (No such file or directory)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
> Method)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
> FileOutputStream.java:270)
> 00:30:04,510 WARN  - at java.io.FileOutputStream. >(FileOutputStream.java:213)
> 00:30:04,510 WARN  - at java.io.FileOutputStream. >(FileOutputStream.java:162)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
> ckResolver.scala:144)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
> ter.java:128)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
> ffleMapTask.runTask(ShuffleMapTask.scala:96)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
> ffleMapTask.runTask(ShuffleMapTask.scala:53)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
> k.run(Task.scala:99)
> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:282)
> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>
> 00:30:04,580 INFO  - Driver stacktrace:
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAn
> dIndependentStages(DAGScheduler.scala:1435)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
> ableArray$class.foreach(ResizableArray.scala:59)
> 00:30:04,580 INFO  - scala.collection.mutable.Array
> Buffer.foreach(ArrayBuffer.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler.abortStage(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> 

Re: some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread Sathish Kumaran Vairavelu
Just a thought. SQL itself is a DSL. Why DSL on top of another DSL?
On Tue, Jul 25, 2017 at 4:47 AM kant kodali  wrote:

> Hi All,
>
> I am thinking to express Spark SQL using JSON in the following the way.
>
> For Example:
>
> *Query using Spark DSL*
>
> DS.filter(col("name").equalTo("john"))
> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
> hours"), df1.col("hourlyPay"))
> .agg(sum("hourlyPay").as("total"));
>
>
> *Query using JSON*
>
>
>
> ​
> ​
> The Goal is to design a DSL in JSON such that users can and express SPARK
> SQL queries in JSON so users can send Spark SQL queries over rest and get
> the results out. Now, I am sure there are BI tools and notebooks like
> Zeppelin that can accomplish the desired behavior however I believe there
> maybe group of users who don't want to use those BI tools or notebooks
> instead they want all the communication from front end to back end using
> API's.
>
> Also another goal would be the DSL design in JSON should closely mimic the
> underlying Spark SQL DSL.
>
> Please feel free to provide some feedback or criticize to whatever extent
> you like!
>
> Thanks!
>
>
>


Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Burak Yavuz
I think Kant meant time windowing functions. You can use

`window(TIMESTAMP, '24 hours', '24 hours')`

On Tue, Jul 25, 2017 at 9:26 AM, Keith Chapman 
wrote:

> Here is an example of a window lead function,
>
> select *, lead(someColumn1) over ( partition by someColumn2 order by
> someColumn13 asc nulls first) as someName  from someTable
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Tue, Jul 25, 2017 at 9:15 AM, kant kodali  wrote:
>
>> How do I Specify windowInterval and slideInteval using raw sql string?
>>
>> On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman 
>> wrote:
>>
>>> You could issue a raw sql query to spark, there is no particular
>>> advantage or disadvantage of doing so. Spark would build a logical plan
>>> from the raw sql (or DSL) and optimize on that. Ideally you would end up
>>> with the same physical plan, irrespective of it been written in raw sql /
>>> DSL.
>>>
>>> Regards,
>>> Keith.
>>>
>>> http://keith-chapman.com
>>>
>>> On Tue, Jul 25, 2017 at 12:50 AM, kant kodali 
>>> wrote:
>>>
 HI All,

 I just want to run some spark structured streaming Job similar to this

 DS.filter(col("name").equalTo("john"))
 .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
 hours"), df1.col("hourlyPay"))
 .agg(sum("hourlyPay").as("total"));


 I am wondering if I can express the above query in raw sql string?

 If so how would that look like and what are some of the disadvantages of 
 using raw sql query vs spark DSL?


 Thanks!


>>>
>>
>


Re: real world spark code

2017-07-25 Thread Matei Zaharia
You can also find a lot of GitHub repos for external packages here: 
http://spark.apache.org/third-party-projects.html

Matei

> On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft  
> wrote:
> 
> There’s a number of real-world open source Spark applications in the sciences:
> 
> genomics:
> 
> github.com/bigdatagenomics/adam <— core is scala, has py/r wrappers
> https://github.com/broadinstitute/gatk <— core is java
> https://github.com/hail-is/hail <— core is scala, mostly used through python 
> wrappers
> 
> neuroscience:
> 
> https://github.com/thunder-project/thunder#using-with-spark <— pyspark
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
>> On Jul 25, 2017, at 8:09 AM, Jörn Franke  wrote:
>> 
>> Continuous integration (Travis, jenkins) and reporting on unit tests, 
>> integration tests etc for each source code version.
>> 
>> On 25. Jul 2017, at 16:58, Adaryl Wakefield  
>> wrote:
>> 
>>> ci+reporting? I’ve never heard of that term before. What is that?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>  
>>>  
>>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>>> Sent: Tuesday, July 25, 2017 8:31 AM
>>> To: Adaryl Wakefield 
>>> Cc: user@spark.apache.org
>>> Subject: Re: real world spark code
>>>  
>>> Look for the ones that have unit and integration tests as well as a 
>>> ci+reporting on code quality.
>>>  
>>> All the others are just toy examples. Well should be :)
>>> 
>>> On 25. Jul 2017, at 01:08, Adaryl Wakefield  
>>> wrote:
>>> 
>>> Anybody know of publicly available GitHub repos of real world Spark 
>>> applications written in scala?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
Here is an example of a window lead function,

select *, lead(someColumn1) over ( partition by someColumn2 order by
someColumn13 asc nulls first) as someName  from someTable

Regards,
Keith.

http://keith-chapman.com

On Tue, Jul 25, 2017 at 9:15 AM, kant kodali  wrote:

> How do I Specify windowInterval and slideInteval using raw sql string?
>
> On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman 
> wrote:
>
>> You could issue a raw sql query to spark, there is no particular
>> advantage or disadvantage of doing so. Spark would build a logical plan
>> from the raw sql (or DSL) and optimize on that. Ideally you would end up
>> with the same physical plan, irrespective of it been written in raw sql /
>> DSL.
>>
>> Regards,
>> Keith.
>>
>> http://keith-chapman.com
>>
>> On Tue, Jul 25, 2017 at 12:50 AM, kant kodali  wrote:
>>
>>> HI All,
>>>
>>> I just want to run some spark structured streaming Job similar to this
>>>
>>> DS.filter(col("name").equalTo("john"))
>>> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
>>> hours"), df1.col("hourlyPay"))
>>> .agg(sum("hourlyPay").as("total"));
>>>
>>>
>>> I am wondering if I can express the above query in raw sql string?
>>>
>>> If so how would that look like and what are some of the disadvantages of 
>>> using raw sql query vs spark DSL?
>>>
>>>
>>> Thanks!
>>>
>>>
>>
>


Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread kant kodali
How do I Specify windowInterval and slideInteval using raw sql string?

On Tue, Jul 25, 2017 at 8:52 AM, Keith Chapman 
wrote:

> You could issue a raw sql query to spark, there is no particular advantage
> or disadvantage of doing so. Spark would build a logical plan from the raw
> sql (or DSL) and optimize on that. Ideally you would end up with the same
> physical plan, irrespective of it been written in raw sql / DSL.
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>
> On Tue, Jul 25, 2017 at 12:50 AM, kant kodali  wrote:
>
>> HI All,
>>
>> I just want to run some spark structured streaming Job similar to this
>>
>> DS.filter(col("name").equalTo("john"))
>> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
>> hours"), df1.col("hourlyPay"))
>> .agg(sum("hourlyPay").as("total"));
>>
>>
>> I am wondering if I can express the above query in raw sql string?
>>
>> If so how would that look like and what are some of the disadvantages of 
>> using raw sql query vs spark DSL?
>>
>>
>> Thanks!
>>
>>
>


Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
You could issue a raw sql query to spark, there is no particular advantage
or disadvantage of doing so. Spark would build a logical plan from the raw
sql (or DSL) and optimize on that. Ideally you would end up with the same
physical plan, irrespective of it been written in raw sql / DSL.

Regards,
Keith.

http://keith-chapman.com

On Tue, Jul 25, 2017 at 12:50 AM, kant kodali  wrote:

> HI All,
>
> I just want to run some spark structured streaming Job similar to this
>
> DS.filter(col("name").equalTo("john"))
> .groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours", "24 
> hours"), df1.col("hourlyPay"))
> .agg(sum("hourlyPay").as("total"));
>
>
> I am wondering if I can express the above query in raw sql string?
>
> If so how would that look like and what are some of the disadvantages of 
> using raw sql query vs spark DSL?
>
>
> Thanks!
>
>


Re: real world spark code

2017-07-25 Thread Frank Austin Nothaft
There’s a number of real-world open source Spark applications in the sciences:

genomics:

github.com/bigdatagenomics/adam  <— 
core is scala, has py/r wrappers
https://github.com/broadinstitute/gatk  
<— core is java
https://github.com/hail-is/hail  <— core is 
scala, mostly used through python wrappers

neuroscience:

https://github.com/thunder-project/thunder#using-with-spark 
 <— pyspark

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

> On Jul 25, 2017, at 8:09 AM, Jörn Franke  wrote:
> 
> Continuous integration (Travis, jenkins) and reporting on unit tests, 
> integration tests etc for each source code version.
> 
> On 25. Jul 2017, at 16:58, Adaryl Wakefield  > wrote:
> 
>> ci+reporting? I’ve never heard of that term before. What is that?
>>  
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.massstreet.net 
>> www.linkedin.com/in/bobwakefieldmba 
>> 
>> Twitter: @BobLovesData 
>>  
>>  
>> From: Jörn Franke [mailto:jornfra...@gmail.com 
>> ] 
>> Sent: Tuesday, July 25, 2017 8:31 AM
>> To: Adaryl Wakefield > >
>> Cc: user@spark.apache.org 
>> Subject: Re: real world spark code
>>  
>> Look for the ones that have unit and integration tests as well as a 
>> ci+reporting on code quality.
>>  
>> All the others are just toy examples. Well should be :)
>> 
>> On 25. Jul 2017, at 01:08, Adaryl Wakefield > > wrote:
>> 
>> Anybody know of publicly available GitHub repos of real world Spark 
>> applications written in scala?
>>  
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>> www.massstreet.net 
>> www.linkedin.com/in/bobwakefieldmba 
>> 
>> Twitter: @BobLovesData 


Re: real world spark code

2017-07-25 Thread Jörn Franke
Continuous integration (Travis, jenkins) and reporting on unit tests, 
integration tests etc for each source code version.

> On 25. Jul 2017, at 16:58, Adaryl Wakefield  
> wrote:
> 
> ci+reporting? I’ve never heard of that term before. What is that?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
>  
> From: Jörn Franke [mailto:jornfra...@gmail.com] 
> Sent: Tuesday, July 25, 2017 8:31 AM
> To: Adaryl Wakefield 
> Cc: user@spark.apache.org
> Subject: Re: real world spark code
>  
> Look for the ones that have unit and integration tests as well as a 
> ci+reporting on code quality.
>  
> All the others are just toy examples. Well should be :)
> 
> On 25. Jul 2017, at 01:08, Adaryl Wakefield  
> wrote:
> 
> Anybody know of publicly available GitHub repos of real world Spark 
> applications written in scala?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
>  


Re: How to list only erros for a stage

2017-07-25 Thread jeff saremi
Thank you. That helps


From: 周康 
Sent: Monday, July 24, 2017 8:04:51 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How to list only erros for a stage

May be you can click  Header Status cloumn of Task section,then failed task 
will appear first.

2017-07-25 10:02 GMT+08:00 jeff saremi 
>:

On the Spark status UI you can click Stages on the menu and see Active (and 
completed stages). For the active stage, you can see Succeeded/Total and a 
count of failed ones in paranthesis.

I'm looking for a way to go straight to the failed tasks and list the errors. 
Currently I must go into details on that stage, then scroll down to Tasks 
section. Change the number of records per page so I can see everything. and 
click Go. There is no way that I can just filter the ones with errors

thanks

jeff




RE: real world spark code

2017-07-25 Thread Adaryl Wakefield
ci+reporting? I’ve never heard of that term before. What is that?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData


From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Tuesday, July 25, 2017 8:31 AM
To: Adaryl Wakefield 
Cc: user@spark.apache.org
Subject: Re: real world spark code

Look for the ones that have unit and integration tests as well as a 
ci+reporting on code quality.

All the others are just toy examples. Well should be :)

On 25. Jul 2017, at 01:08, Adaryl Wakefield 
> wrote:
Anybody know of publicly available GitHub repos of real world Spark 
applications written in scala?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData




Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Thanks Xiayun Sun, Robin East for your inputs. It make sense to me.

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Jul 25, 2017 at 9:55 AM, Xiayun Sun  wrote:

> I'm guessing by "part files" you mean files like part-r-0. These are
> actually different from hadoop "block size", which is the value actually
> used in partitions.
>
> Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
> files -> around 528mb per part file -> each part file would take a little
> more than 4 blocks -> total that would be around 2000 blocks.
>
> You cannot set partitions fewer than blocks, that's why 500 does not work
> (spark doc here
> )
>
> The textFile method also takes an optional second argument for
>> controlling the number of partitions of the file. By default, Spark creates
>> one partition for each block of the file (blocks being 128MB by default in
>> HDFS), but you can also ask for a higher number of partitions by passing a
>> larger value. Note that you cannot have fewer partitions than blocks.
>
>
>
> Now as to why 3000 gives you 3070 partitions, spark use hadoop's
> InputFormat.getSplits
> ,
> and the partition desired is at most a hint, so the final result could be a
> bit different.
>
> On 25 July 2017 at 19:54, Gokula Krishnan D  wrote:
>
>> Excuse for the too many mails on this post.
>>
>> found a similar issue https://stackoverflow.co
>> m/questions/24671755/how-to-partition-a-rdd
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D 
>> wrote:
>>
>>> In addition to that,
>>>
>>> tried to read the same file with 3000 partitions but it used 3070
>>> partitions. And took more time than previous please refer the attachment.
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D 
>>> wrote:
>>>
 Hello All,

 I have a HDFS file with approx. *1.5 Billion records* with 500 Part
 files (258.2GB Size) and when I tried to execute the following I could see
 that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
 it?

 val inputFile = 
 val inputRdd = sc.textFile(inputFile)
 inputRdd.count()

 I was hoping that I can do the same with the fewer partitions so tried
 the following

 val inputFile = 
 val inputrddnqew = sc.textFile(inputFile,500)
 inputRddNew.count()

 But still it used 2290 tasks.

 As per scala doc, it supposed use as like the HDFS file i.e 500.

 It would be great if you could throw some insight on this.

 Thanks & Regards,
 Gokula Krishnan* (Gokul)*

>>>
>>>
>>
>
>
> --
> Xiayun Sun
>
> Home is behind, the world ahead,
> and there are many paths to tread
> through shadows to the edge of night,
> until the stars are all alight.
>


Re: Nested JSON Handling in Spark 2.1

2017-07-25 Thread Patrick
Hi,

I would appreciate some suggestions on how to achieve top level struct
treatment to nested JSON when stored in Parquet format. Or any other
solutions for best performance using Spark 2.1.

Thanks in advance


On Mon, Jul 24, 2017 at 4:11 PM, Patrick  wrote:

> To avoid confusion, the query i am referring above is over some numeric
> element inside *a: struct (nullable = true).*
>
> On Mon, Jul 24, 2017 at 4:04 PM, Patrick  wrote:
>
>> Hi,
>>
>> On reading a complex JSON, Spark infers schema as following:
>>
>> root
>>  |-- header: struct (nullable = true)
>>  ||-- deviceId: string (nullable = true)
>>  ||-- sessionId: string (nullable = true)
>>  |-- payload: struct (nullable = true)
>>  ||-- deviceObjects: array (nullable = true)
>>  |||-- element: struct (containsNull = true)
>>  ||||-- additionalPayload: array (nullable = true)
>>  |||||-- element: struct (containsNull = true)
>>  ||||||-- data: struct (nullable = true)
>>  |||||||-- *a: struct (nullable = true)*
>>  ||||||||-- address: string (nullable = true)
>>
>> When we save the above Json in parquet using Spark sql we get only two
>> top level columns "header" and "payload" in parquet.
>>
>> So now we want to do a mean calculation on element  *a: struct (nullable
>> = true)*
>>
>> With reference to the Databricks blog for handling complex JSON
>> https://databricks.com/blog/2017/02/23/working-complex-data-
>> formats-structured-streaming-apache-spark-2-1.html
>>
>> *"when using Parquet, all struct columns will receive the same treatment
>> as top-level columns. Therefore, if you have filters on a nested field, you
>> will get the same benefits as a top-level column."*
>>
>> Referring to the above statement, will parquet treat *a: struct
>> (nullable = true)* as top-level column struct and SQL query on the
>> Dataset will be optimized?
>>
>> If not, do we need to externally impose the schema by exploding the
>> complex type before writing to parquet in order to get top-level column
>> benefit? What we can do with Spark 2.1, to extract the best performance
>> over such nested structure like *a: struct (nullable = true).*
>>
>> Thanks
>>
>>
>
>


Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Xiayun Sun
I'm guessing by "part files" you mean files like part-r-0. These are
actually different from hadoop "block size", which is the value actually
used in partitions.

Looks like your hdfs block size is the default 128mb: 258.2GB in 500 part
files -> around 528mb per part file -> each part file would take a little
more than 4 blocks -> total that would be around 2000 blocks.

You cannot set partitions fewer than blocks, that's why 500 does not work
(spark doc here
)

The textFile method also takes an optional second argument for controlling
> the number of partitions of the file. By default, Spark creates one
> partition for each block of the file (blocks being 128MB by default in
> HDFS), but you can also ask for a higher number of partitions by passing a
> larger value. Note that you cannot have fewer partitions than blocks.



Now as to why 3000 gives you 3070 partitions, spark use hadoop's
InputFormat.getSplits
,
and the partition desired is at most a hint, so the final result could be a
bit different.

On 25 July 2017 at 19:54, Gokula Krishnan D  wrote:

> Excuse for the too many mails on this post.
>
> found a similar issue https://stackoverflow.com/questions/24671755/how-to-
> partition-a-rdd
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D 
> wrote:
>
>> In addition to that,
>>
>> tried to read the same file with 3000 partitions but it used 3070
>> partitions. And took more time than previous please refer the attachment.
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D 
>> wrote:
>>
>>> Hello All,
>>>
>>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part
>>> files (258.2GB Size) and when I tried to execute the following I could see
>>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
>>> it?
>>>
>>> val inputFile = 
>>> val inputRdd = sc.textFile(inputFile)
>>> inputRdd.count()
>>>
>>> I was hoping that I can do the same with the fewer partitions so tried
>>> the following
>>>
>>> val inputFile = 
>>> val inputrddnqew = sc.textFile(inputFile,500)
>>> inputRddNew.count()
>>>
>>> But still it used 2290 tasks.
>>>
>>> As per scala doc, it supposed use as like the HDFS file i.e 500.
>>>
>>> It would be great if you could throw some insight on this.
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>


-- 
Xiayun Sun

Home is behind, the world ahead,
and there are many paths to tread
through shadows to the edge of night,
until the stars are all alight.


Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
sc.textFile will use the Hadoop TextInputFormat (I believe), this will use the 
Hadoop block size to read records from HDFS. Most likely the block size is 
128MB. Not sure you can do anything about the number of tasks generated to read 
from HDFS.
---
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 






> On 25 Jul 2017, at 13:21, Gokula Krishnan D  wrote:
> 
> In addition to that, 
> 
> tried to read the same file with 3000 partitions but it used 3070 partitions. 
> And took more time than previous please refer the attachment. 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 
> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D  > wrote:
> Hello All, 
> 
> I have a HDFS file with approx. 1.5 Billion records with 500 Part files 
> (258.2GB Size) and when I tried to execute the following I could see that it 
> used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it?
> 
> val inputFile = 
> val inputRdd = sc.textFile(inputFile)
> inputRdd.count()
> 
> I was hoping that I can do the same with the fewer partitions so tried the 
> following 
> 
> val inputFile = 
> val inputrddnqew = sc.textFile(inputFile,500)
> inputRddNew.count()
> 
> But still it used 2290 tasks.
> 
> As per scala doc, it supposed use as like the HDFS file i.e 500. 
> 
> It would be great if you could throw some insight on this. 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 


Re: Spark Data Frame Writer - Range Partiotioning

2017-07-25 Thread Jain, Nishit
But wouldn’t partitioning column partition the data only in Spark RDD? Would it 
also partition columns at disk when data is written (diving data in folders)?

From: ayan guha >
Date: Friday, July 21, 2017 at 3:25 PM
To: "Jain, Nishit" >, 
"user@spark.apache.org" 
>
Subject: Re: Spark Data Frame Writer - Range Partiotioning

How about creating a partituon column and use it?

On Sat, 22 Jul 2017 at 2:47 am, Jain, Nishit 
> wrote:

Is it possible to have Spark Data Frame Writer write based on RangePartioning?

For Ex -

I have 10 distinct values for column_a, say 1 to 10.

df.write
.partitionBy("column_a")


Above code by default will create 10 folders .. column_a=1,column_a=2 
...column_a=10

I want to see if it is possible to have these partitions based on bucket - 
col_a=1to5, col_a=5-10 .. or something like that? Then also have query engine 
respect it

Thanks,

Nishit

--
Best Regards,
Ayan Guha


Re: real world spark code

2017-07-25 Thread Jörn Franke
Look for the ones that have unit and integration tests as well as a 
ci+reporting on code quality.

All the others are just toy examples. Well should be :)

> On 25. Jul 2017, at 01:08, Adaryl Wakefield  
> wrote:
> 
> Anybody know of publicly available GitHub repos of real world Spark 
> applications written in scala?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
> www.massstreet.net
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
>  


Re: real world spark code

2017-07-25 Thread Xiayun Sun
usually I look in github repos of those big name companies that I know are
actively doing machine learning.

For example, here are two spark-related repos from soundcloud:
- https://github.com/soundcloud/spark-pagerank
- https://github.com/soundcloud/cosine-lsh-join-spark

On 25 July 2017 at 06:08, Adaryl Wakefield 
wrote:

> Anybody know of publicly available GitHub repos of real world Spark
> applications written in scala?
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData 
>
>
>
>
>



-- 
Xiayun Sun

Home is behind, the world ahead,
and there are many paths to tread
through shadows to the edge of night,
until the stars are all alight.


Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Excuse for the too many mails on this post.

found a similar issue
https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D 
wrote:

> In addition to that,
>
> tried to read the same file with 3000 partitions but it used 3070
> partitions. And took more time than previous please refer the attachment.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D 
> wrote:
>
>> Hello All,
>>
>> I have a HDFS file with approx. *1.5 Billion records* with 500 Part
>> files (258.2GB Size) and when I tried to execute the following I could see
>> that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't
>> it?
>>
>> val inputFile = 
>> val inputRdd = sc.textFile(inputFile)
>> inputRdd.count()
>>
>> I was hoping that I can do the same with the fewer partitions so tried
>> the following
>>
>> val inputFile = 
>> val inputrddnqew = sc.textFile(inputFile,500)
>> inputRddNew.count()
>>
>> But still it used 2290 tasks.
>>
>> As per scala doc, it supposed use as like the HDFS file i.e 500.
>>
>> It would be great if you could throw some insight on this.
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>
>


Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
In addition to that,

tried to read the same file with 3000 partitions but it used 3070
partitions. And took more time than previous please refer the attachment.

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D 
wrote:

> Hello All,
>
> I have a HDFS file with approx. *1.5 Billion records* with 500 Part files
> (258.2GB Size) and when I tried to execute the following I could see that
> it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it?
>
> val inputFile = 
> val inputRdd = sc.textFile(inputFile)
> inputRdd.count()
>
> I was hoping that I can do the same with the fewer partitions so tried the
> following
>
> val inputFile = 
> val inputrddnqew = sc.textFile(inputFile,500)
> inputRddNew.count()
>
> But still it used 2290 tasks.
>
> As per scala doc, it supposed use as like the HDFS file i.e 500.
>
> It would be great if you could throw some insight on this.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Hello All,

I have a HDFS file with approx. *1.5 Billion records* with 500 Part files
(258.2GB Size) and when I tried to execute the following I could see that
it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it?

val inputFile = 
val inputRdd = sc.textFile(inputFile)
inputRdd.count()

I was hoping that I can do the same with the fewer partitions so tried the
following

val inputFile = 
val inputrddnqew = sc.textFile(inputFile,500)
inputRddNew.count()

But still it used 2290 tasks.

As per scala doc, it supposed use as like the HDFS file i.e 500.

It would be great if you could throw some insight on this.

Thanks & Regards,
Gokula Krishnan* (Gokul)*

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

some Ideas on expressing Spark SQL using JSON

2017-07-25 Thread kant kodali
Hi All,

I am thinking to express Spark SQL using JSON in the following the way.

For Example:

*Query using Spark DSL*

DS.filter(col("name").equalTo("john"))
.groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours",
"24 hours"), df1.col("hourlyPay"))
.agg(sum("hourlyPay").as("total"));


*Query using JSON*



​
​
The Goal is to design a DSL in JSON such that users can and express SPARK
SQL queries in JSON so users can send Spark SQL queries over rest and get
the results out. Now, I am sure there are BI tools and notebooks like
Zeppelin that can accomplish the desired behavior however I believe there
maybe group of users who don't want to use those BI tools or notebooks
instead they want all the communication from front end to back end using
API's.

Also another goal would be the DSL design in JSON should closely mimic the
underlying Spark SQL DSL.

Please feel free to provide some feedback or criticize to whatever extent
you like!

Thanks!


Re: ClassNotFoundException for Workers

2017-07-25 Thread 周康
Ensure com.amazonaws.services.s3.AmazonS3ClientBuilder in your classpath
which include your application jar and attached executor jars.

2017-07-20 6:12 GMT+08:00 Noppanit Charassinvichai :

> I have this spark job which is using S3 client in mapPartition. And I get
> this error
>
> Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most
> recent failure: Lost task 0.3 in stage 3.0 (TID 74,
> ip-10-90-78-177.ec2.internal, executor 11): java.lang.NoClassDefFoundError:
> Could not initialize class com.amazonaws.services.s3.AmazonS3ClientBuilder
> +details
> Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most
> recent failure: Lost task 0.3 in stage 3.0 (TID 74,
> ip-10-90-78-177.ec2.internal, executor 11): java.lang.NoClassDefFoundError:
> Could not initialize class com.amazonaws.services.s3.AmazonS3ClientBuilder
> at SparrowOrc$$anonfun$1.apply(sparrowOrc.scala:49)
> at SparrowOrc$$anonfun$1.apply(sparrowOrc.scala:46)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:796)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$
> anonfun$apply$23.apply(RDD.scala:796)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:282)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> This is my code
> val jsonRows = sqs.mapPartitions(partitions => {
>   val s3Client = AmazonS3ClientBuilder.standard().withCredentials(new
> DefaultCredentialsProvider).build()
>
>   val txfm = new LogLine2Json
>   val log = Logger.getLogger("parseLog")
>
>   partitions.flatMap(messages => {
> val sqsMsg = Json.parse(messages)
> val bucketName = Json.stringify(sqsMsg("
> Records")(0)("s3")("bucket")("name")).replace("\"", "")
> val key = 
> Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"",
> "")
> val obj = s3Client.getObject(new GetObjectRequest(bucketName, key))
> val stream = obj.getObjectContent()
>
> scala.io.Source.fromInputStream(stream).getLines().map(line => {
>   try {
> txfm.parseLine(line)
>   }
>   catch {
> case e: Throwable => {
>   log.info(line); "{}";
> }
>   }
> }).filter(line => line != "{}")
>   })
> })
>
> This is my build.sbt
>
> name := "sparrow-to-orc"
>
> version := "0.1"
>
> scalaVersion := "2.11.8"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.1.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0" %
> "provided"
>
> libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" %
> "provided"
> libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3" %
> "provided"
> libraryDependencies += "com.cn" %% "sparrow-clf-parser" % "1.1-SNAPSHOT"
>
> libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.155"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.155"
> libraryDependencies += "com.amazonaws" % "aws-java-sdk-core" % "1.11.155"
>
> libraryDependencies += "com.github.seratch" %% "awscala" % "0.6.+"
> libraryDependencies += "com.typesafe.play" %% "play-json" % "2.6.0"
> dependencyOverrides ++= Set("com.fasterxml.jackson.core" %
> "jackson-databind" % "2.6.0")
>
>
>
> assemblyMergeStrategy in assembly := {
>   case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
>   case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
>   case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
>   case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
>   case PathList("org", "apache", xs @ _*) => MergeStrategy.last
>   case PathList("com", "google", xs @ _*) => MergeStrategy.last
>   case PathList("com", "esotericsoftware", xs @ _*) => 

What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread kant kodali
HI All,

I just want to run some spark structured streaming Job similar to this

DS.filter(col("name").equalTo("john"))
.groupBy(functions.window(df1.col("TIMESTAMP"), "24 hours",
"24 hours"), df1.col("hourlyPay"))
.agg(sum("hourlyPay").as("total"));


I am wondering if I can express the above query in raw sql string?

If so how would that look like and what are some of the disadvantages
of using raw sql query vs spark DSL?


Thanks!