Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Nicholas Chammas
I don't have a dog in this race, but: Would it be OK to ship 3.0 with some
release notes and/or prominent documentation calling out this issue, and
then fixing it in 3.0.1?

On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim 
wrote:

> I'd say SPARK-31257 as open blocker, because the change in upcoming Spark
> 3.0 made the create table be ambiguous, and once it's shipped it will be
> harder to correct again.
>
> On Sun, Mar 29, 2020 at 4:53 AM Reynold Xin  wrote:
>
>> Let's start cutting RC next week.
>>
>>
>> On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen  wrote:
>>
>>> I'm also curious - there no open blockers for 3.0 but I know a few are
>>> still floating around open to revert changes. What is the status there?
>>> From my field of view I'm not aware of other blocking issues.
>>>
>>> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Now the end of March is just around the corner. I'm not qualified to
 say (and honestly don't know) where we are, but if we were intended to be
 in blocker mode it doesn't seem to work; lots of developments still happen,
 and priority/urgency doesn't seem to be applied to the sequence of
 reviewing.

 How about listing (or linking to epic, or labelling) JIRA issues/PRs
 which are blockers (either from priority or technically) for Spark 3.0
 release, and make clear we should try to review these blockers
 first? Github PR label may help here to filter out other PRs and
 concentrate these things.

 Thanks,
 Jungtaek Lim (HeartSaVioR)


 On Wed, Mar 25, 2020 at 1:52 PM Xiao Li  wrote:

> Let us try to finish the remaining major blockers in the next few
> days. For example, https://issues.apache.org/jira/browse/SPARK-31085
>
> +1 to cut the RC even if we still have the blockers that will fail the
> RCs.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thanks,
>> Dongjoon.
>>
>> On Tue, Mar 24, 2020 at 14:49 Reynold Xin 
>> wrote:
>>
>>> I actually think we should start cutting RCs. We can cut RCs even
>>> with blockers.
>>>
>>>
>>> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 First of all, always "Community Over Code"!
 I wish you the best health and happiness.

 As we know, we are still working on QA period, we didn't reach RC
 stage. It seems that we need to make website up-to-date once more.

 https://spark.apache.org/versioning-policy.html

 If possible, it would be really great if we can get `3.0.0` release
 manager's official `branch-3.0` assessment because we have only 1 week
 before the end of March.

 Cloud you, the 3.0.0 release manager, share your thought and update
 the website, please?

 Bests
 Dongjoon.

>>>
>>>
>
> --
> 
>

>>


Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Jungtaek Lim
I'd say SPARK-31257 as open blocker, because the change in upcoming Spark
3.0 made the create table be ambiguous, and once it's shipped it will be
harder to correct again.

On Sun, Mar 29, 2020 at 4:53 AM Reynold Xin  wrote:

> Let's start cutting RC next week.
>
>
> On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen  wrote:
>
>> I'm also curious - there no open blockers for 3.0 but I know a few are
>> still floating around open to revert changes. What is the status there?
>> From my field of view I'm not aware of other blocking issues.
>>
>> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Now the end of March is just around the corner. I'm not qualified to say
>>> (and honestly don't know) where we are, but if we were intended to be in
>>> blocker mode it doesn't seem to work; lots of developments still happen,
>>> and priority/urgency doesn't seem to be applied to the sequence of
>>> reviewing.
>>>
>>> How about listing (or linking to epic, or labelling) JIRA issues/PRs
>>> which are blockers (either from priority or technically) for Spark 3.0
>>> release, and make clear we should try to review these blockers
>>> first? Github PR label may help here to filter out other PRs and
>>> concentrate these things.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Wed, Mar 25, 2020 at 1:52 PM Xiao Li  wrote:
>>>
 Let us try to finish the remaining major blockers in the next few days.
 For example, https://issues.apache.org/jira/browse/SPARK-31085

 +1 to cut the RC even if we still have the blockers that will fail the
 RCs.

 Cheers,

 Xiao


 On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun 
 wrote:

> +1
>
> Thanks,
> Dongjoon.
>
> On Tue, Mar 24, 2020 at 14:49 Reynold Xin  wrote:
>
>> I actually think we should start cutting RCs. We can cut RCs even
>> with blockers.
>>
>>
>> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Hi, All.
>>>
>>> First of all, always "Community Over Code"!
>>> I wish you the best health and happiness.
>>>
>>> As we know, we are still working on QA period, we didn't reach RC
>>> stage. It seems that we need to make website up-to-date once more.
>>>
>>> https://spark.apache.org/versioning-policy.html
>>>
>>> If possible, it would be really great if we can get `3.0.0` release
>>> manager's official `branch-3.0` assessment because we have only 1 week
>>> before the end of March.
>>>
>>> Cloud you, the 3.0.0 release manager, share your thought and update
>>> the website, please?
>>>
>>> Bests
>>> Dongjoon.
>>>
>>
>>

 --
 

>>>
>


Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Reynold Xin
Let's start cutting RC next week.

On Sat, Mar 28, 2020 at 11:51 AM, Sean Owen < sro...@gmail.com > wrote:

> 
> I'm also curious - there no open blockers for 3.0 but I know a few are
> still floating around open to revert changes. What is the status there?
> From my field of view I'm not aware of other blocking issues.
> 
> On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim < kabhwan. opensource@ gmail.
> com ( kabhwan.opensou...@gmail.com ) > wrote:
> 
> 
>> Now the end of March is just around the corner. I'm not qualified to say
>> (and honestly don't know) where we are, but if we were intended to be in
>> blocker mode it doesn't seem to work; lots of developments still happen,
>> and priority/urgency doesn't seem to be applied to the sequence of
>> reviewing.
>> 
>> 
>> How about listing (or linking to epic, or labelling) JIRA issues/PRs which
>> are blockers (either from priority or technically) for Spark 3.0 release,
>> and make clear we should try to review these blockers first? Github PR
>> label may help here to filter out other PRs and concentrate these things.
>> 
>> 
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 
>> 
>> On Wed, Mar 25, 2020 at 1:52 PM Xiao Li < lixiao@ databricks. com (
>> lix...@databricks.com ) > wrote:
>> 
>> 
>>> Let us try to finish the remaining major blockers in the next few days.
>>> For example, https:/ / issues. apache. org/ jira/ browse/ SPARK-31085 (
>>> https://issues.apache.org/jira/browse/SPARK-31085 )
>>> 
>>> 
>>> +1 to cut the RC even if we still have the blockers that will fail the
>>> RCs. 
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Xiao
>>> 
>>> 
>>> 
>>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun < dongjoon. hyun@ gmail. com
>>> ( dongjoon.h...@gmail.com ) > wrote:
>>> 
>>> 
 +1
 
 
 Thanks,
 Dongjoon.
 
 On Tue, Mar 24, 2020 at 14:49 Reynold Xin < rxin@ databricks. com (
 r...@databricks.com ) > wrote:
 
 
> I actually think we should start cutting RCs. We can cut RCs even with
> blockers.
> 
> 
> 
> On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun < dongjoon. hyun@ gmail. 
> com
> ( dongjoon.h...@gmail.com ) > wrote:
> 
>> Hi, All.
>> 
>> First of all, always "Community Over Code"!
>> I wish you the best health and happiness.
>> 
>> As we know, we are still working on QA period, we didn't reach RC stage.
>> It seems that we need to make website up-to-date once more.
>> 
>>    https:/ / spark. apache. org/ versioning-policy. html (
>> https://spark.apache.org/versioning-policy.html )
>> 
>> If possible, it would be really great if we can get `3.0.0` release
>> manager's official `branch-3.0` assessment because we have only 1 week
>> before the end of March.
>> 
>> 
>> Cloud you, the 3.0.0 release manager, share your thought and update the
>> website, please?
>> 
>> 
>> Bests
>> Dongjoon.
>> 
> 
> 
> 
> 
 
 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> ( https://databricks.com/sparkaisummit/north-america )
>>> 
>> 
>> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Sean Owen
I'm also curious - there no open blockers for 3.0 but I know a few are
still floating around open to revert changes. What is the status there?
>From my field of view I'm not aware of other blocking issues.

On Fri, Mar 27, 2020 at 10:56 PM Jungtaek Lim 
wrote:

> Now the end of March is just around the corner. I'm not qualified to say
> (and honestly don't know) where we are, but if we were intended to be in
> blocker mode it doesn't seem to work; lots of developments still happen,
> and priority/urgency doesn't seem to be applied to the sequence of
> reviewing.
>
> How about listing (or linking to epic, or labelling) JIRA issues/PRs which
> are blockers (either from priority or technically) for Spark 3.0 release,
> and make clear we should try to review these blockers first? Github PR
> label may help here to filter out other PRs and concentrate these things.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Wed, Mar 25, 2020 at 1:52 PM Xiao Li  wrote:
>
>> Let us try to finish the remaining major blockers in the next few days.
>> For example, https://issues.apache.org/jira/browse/SPARK-31085
>>
>> +1 to cut the RC even if we still have the blockers that will fail the
>> RCs.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> On Tue, Mar 24, 2020 at 6:56 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> On Tue, Mar 24, 2020 at 14:49 Reynold Xin  wrote:
>>>
 I actually think we should start cutting RCs. We can cut RCs even with
 blockers.


 On Tue, Mar 24, 2020 at 12:51 PM, Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Hi, All.
>
> First of all, always "Community Over Code"!
> I wish you the best health and happiness.
>
> As we know, we are still working on QA period, we didn't reach RC
> stage. It seems that we need to make website up-to-date once more.
>
> https://spark.apache.org/versioning-policy.html
>
> If possible, it would be really great if we can get `3.0.0` release
> manager's official `branch-3.0` assessment because we have only 1 week
> before the end of March.
>
> Cloud you, the 3.0.0 release manager, share your thought and update
> the website, please?
>
> Bests
> Dongjoon.
>


>>
>> --
>> 
>>
>


Beautiful Spark Code

2020-03-28 Thread Zahid Rahman
You will be please to learn that Mr. Mathew Powers have seen to my needs
and answers all my questions.

He has seen to all my needs.

Mr Powers  has shut me up !!!

Mr Powers has made Google search stackoverflow and u...@spark.apache.org
redundant.

That is all you guys and girl had to do , point me to his book.

https://leanpub.com/beautiful-spark

https://mungingdata.com/writing-beautiful-apache-spark2-code-with-scala/




On Sat, 28 Mar 2020, 16:49 Zahid Rahman,  wrote:

> Thanks for the tip!
>
> But if the first thing you come across
> Is somebody  using the trim function to strip away spaces in
> /etc/hostnames like so from :
>
> 127.0.0.1 hostname local
>
> To
>
> 127.0.0.1hostnamelocal
>
> Then there is a log error message showing the outcome of unnecessarily
> using the trim function.
>
> Especially when one of the spark core functionality is to read lines from
> files separated by a space, comma.
>
> Also have you seen the log4j.properties
> Setting to ERROR and in one case FATAL
> for suppressing discrepancies.
>
> Please May I draw your attention and attention of all in the community to
> this page Which shows turning on compiler WARNINGS  before releasing
> software and other software best practices.
>
> “The Power of 10 — NASA’s Rules for Coding” by Riccardo Giorato
> https://link.medium.com/PUz88PIql3
>
> What impression  would you have  ?
>
>
>
> On Sat, 28 Mar 2020, 15:50 Jeff Evans, 
> wrote:
>
>> Dude, you really need to chill. Have you ever worked with a large open
>> source project before? It seems not. Even so, insinuating there are tons of
>> bugs that were left uncovered until you came along (despite the fact that
>> the project is used by millions across many different organizations) is
>> ludicrous. Learn a little bit of humility
>>
>> If you're new to something, assume you have made a mistake rather than
>> that there is a bug. Lurk a bit more, or even do a simple Google search,
>> and you will realize Sean is a very senior committer (i.e. expert) in
>> Spark, and has been for many years. He, and everyone else participating in
>> these lists, is doing it voluntarily on their own time. They're not being
>> paid to handhold you and quickly answer to your every whim.
>>
>> On Sat, Mar 28, 2020, 10:46 AM Zahid Rahman  wrote:
>>
>>> So the schema is limited to holding only the DEFINITION of schema. For
>>> example as you say  the columns, I.e. first column User:Int 2nd column
>>> String:password.
>>>
>>> Not location of source I.e. csv file with or without header.  SQL DB
>>> tables.
>>>
>>> I am pleased for once I am wrong about being another bug, and it was a
>>> design decision adding flexibility.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, 28 Mar 2020, 15:24 Russell Spitzer, 
>>> wrote:
>>>
 This is probably more of a question for the user support list, but I
 believe I understand the issue.

 Schema inside of spark refers to the structure of the output rows, for
 example the schema for a particular dataframe could be
 (User: Int, Password: String) - Two Columns the first is User of type
 int and the second is Password of Type String.

 When you pass the schema from one reader to another, you are only
 copyting this structure, not all of the other options associated with the
 dataframe.
 This is usually useful when you are reading from sources with different
 options but data that needs to be read into the same structure.

 The other properties such as "format" and "options" exist independently
 of Schema. This is helpful if I was reading from both MySQL and
 a comma separated file for example. While the Schema is the same, the
 options like ("inferSchema") do not apply to both MySql and CSV and
 format actually picks whether to us "JDBC" or "CSV" so copying that
 wouldn't be helpful either.

 I hope this clears things up,
 Russ

 On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman 
 wrote:

> Hi,
> version: spark-3.0.0-preview2-bin-hadoop2.7
>
> As you can see from the code :
>
> STEP 1:  I  create a object of type static frame which holds all the
> information to the datasource (csv files).
>
> STEP 2: Then I create a variable  called staticSchema  assigning the
> information of the schema from the original static data frame.
>
> STEP 3: then I create another variable called val streamingDataFrame
> of type spark.readStream.
> and Into the .schema function parameters I pass the object
> staticSchema which is meant to hold the information to the  csv files
> including the .load(path) function etc.
>
> So then when I am creating val StreamingDataFrame and passing it
> .schema(staticSchema)
> the variable StreamingDataFrame  should have all the information.
> I should only have to call .option("maxFilePerTrigger",1) and not
> .format ("csv")
> .option("header","true").load("/data/retail-dat

Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Zahid Rahman
Thanks for the tip!

But if the first thing you come across
Is somebody  using the trim function to strip away spaces in /etc/hostnames
like so from :

127.0.0.1 hostname local

To

127.0.0.1hostnamelocal

Then there is a log error message showing the outcome of unnecessarily
using the trim function.

Especially when one of the spark core functionality is to read lines from
files separated by a space, comma.

Also have you seen the log4j.properties
Setting to ERROR and in one case FATAL
for suppressing discrepancies.

Please May I draw your attention and attention of all in the community to
this page Which shows turning on compiler WARNINGS  before releasing
software and other software best practices.

“The Power of 10 — NASA’s Rules for Coding” by Riccardo Giorato
https://link.medium.com/PUz88PIql3

What impression  would you have  ?



On Sat, 28 Mar 2020, 15:50 Jeff Evans, 
wrote:

> Dude, you really need to chill. Have you ever worked with a large open
> source project before? It seems not. Even so, insinuating there are tons of
> bugs that were left uncovered until you came along (despite the fact that
> the project is used by millions across many different organizations) is
> ludicrous. Learn a little bit of humility
>
> If you're new to something, assume you have made a mistake rather than
> that there is a bug. Lurk a bit more, or even do a simple Google search,
> and you will realize Sean is a very senior committer (i.e. expert) in
> Spark, and has been for many years. He, and everyone else participating in
> these lists, is doing it voluntarily on their own time. They're not being
> paid to handhold you and quickly answer to your every whim.
>
> On Sat, Mar 28, 2020, 10:46 AM Zahid Rahman  wrote:
>
>> So the schema is limited to holding only the DEFINITION of schema. For
>> example as you say  the columns, I.e. first column User:Int 2nd column
>> String:password.
>>
>> Not location of source I.e. csv file with or without header.  SQL DB
>> tables.
>>
>> I am pleased for once I am wrong about being another bug, and it was a
>> design decision adding flexibility.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, 28 Mar 2020, 15:24 Russell Spitzer, 
>> wrote:
>>
>>> This is probably more of a question for the user support list, but I
>>> believe I understand the issue.
>>>
>>> Schema inside of spark refers to the structure of the output rows, for
>>> example the schema for a particular dataframe could be
>>> (User: Int, Password: String) - Two Columns the first is User of type
>>> int and the second is Password of Type String.
>>>
>>> When you pass the schema from one reader to another, you are only
>>> copyting this structure, not all of the other options associated with the
>>> dataframe.
>>> This is usually useful when you are reading from sources with different
>>> options but data that needs to be read into the same structure.
>>>
>>> The other properties such as "format" and "options" exist independently
>>> of Schema. This is helpful if I was reading from both MySQL and
>>> a comma separated file for example. While the Schema is the same, the
>>> options like ("inferSchema") do not apply to both MySql and CSV and
>>> format actually picks whether to us "JDBC" or "CSV" so copying that
>>> wouldn't be helpful either.
>>>
>>> I hope this clears things up,
>>> Russ
>>>
>>> On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman 
>>> wrote:
>>>
 Hi,
 version: spark-3.0.0-preview2-bin-hadoop2.7

 As you can see from the code :

 STEP 1:  I  create a object of type static frame which holds all the
 information to the datasource (csv files).

 STEP 2: Then I create a variable  called staticSchema  assigning the
 information of the schema from the original static data frame.

 STEP 3: then I create another variable called val streamingDataFrame of
 type spark.readStream.
 and Into the .schema function parameters I pass the object staticSchema
 which is meant to hold the information to the  csv files including the
 .load(path) function etc.

 So then when I am creating val StreamingDataFrame and passing it
 .schema(staticSchema)
 the variable StreamingDataFrame  should have all the information.
 I should only have to call .option("maxFilePerTrigger",1) and not
 .format ("csv")
 .option("header","true").load("/data/retail-data/by-day/*.csv")
 Otherwise what is the point of passing .schema(staticSchema) to
 StreamingDataFrame.

 You can replicate it using the complete code below.

 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.functions.{window,column,desc,col}

 object RetailData {

   def main(args: Array[String]): Unit = {

 // create spark session
 val spark = 
 SparkSession.builder().master("spark://192.168.0.38:7077").appName("Retail 
 Data").getOrCreate();
 // set spark runtime  configuration
 spark.conf.set("spark.sql.s

Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Zahid Rahman
So the schema is limited to holding only the DEFINITION of schema. For
example as you say  the columns, I.e. first column User:Int 2nd column
String:password.

Not location of source I.e. csv file with or without header.  SQL DB tables.

I am pleased for once I am wrong about being another bug, and it was a
design decision adding flexibility.









On Sat, 28 Mar 2020, 15:24 Russell Spitzer, 
wrote:

> This is probably more of a question for the user support list, but I
> believe I understand the issue.
>
> Schema inside of spark refers to the structure of the output rows, for
> example the schema for a particular dataframe could be
> (User: Int, Password: String) - Two Columns the first is User of type int
> and the second is Password of Type String.
>
> When you pass the schema from one reader to another, you are only
> copyting this structure, not all of the other options associated with the
> dataframe.
> This is usually useful when you are reading from sources with different
> options but data that needs to be read into the same structure.
>
> The other properties such as "format" and "options" exist independently of
> Schema. This is helpful if I was reading from both MySQL and
> a comma separated file for example. While the Schema is the same, the
> options like ("inferSchema") do not apply to both MySql and CSV and
> format actually picks whether to us "JDBC" or "CSV" so copying that
> wouldn't be helpful either.
>
> I hope this clears things up,
> Russ
>
> On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman  wrote:
>
>> Hi,
>> version: spark-3.0.0-preview2-bin-hadoop2.7
>>
>> As you can see from the code :
>>
>> STEP 1:  I  create a object of type static frame which holds all the
>> information to the datasource (csv files).
>>
>> STEP 2: Then I create a variable  called staticSchema  assigning the
>> information of the schema from the original static data frame.
>>
>> STEP 3: then I create another variable called val streamingDataFrame of
>> type spark.readStream.
>> and Into the .schema function parameters I pass the object staticSchema
>> which is meant to hold the information to the  csv files including the
>> .load(path) function etc.
>>
>> So then when I am creating val StreamingDataFrame and passing it
>> .schema(staticSchema)
>> the variable StreamingDataFrame  should have all the information.
>> I should only have to call .option("maxFilePerTrigger",1) and not .format
>> ("csv") .option("header","true").load("/data/retail-data/by-day/*.csv")
>> Otherwise what is the point of passing .schema(staticSchema) to
>> StreamingDataFrame.
>>
>> You can replicate it using the complete code below.
>>
>> import org.apache.spark.sql.SparkSession
>> import org.apache.spark.sql.functions.{window,column,desc,col}
>>
>> object RetailData {
>>
>>   def main(args: Array[String]): Unit = {
>>
>> // create spark session
>> val spark = 
>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Retail 
>> Data").getOrCreate();
>> // set spark runtime  configuration
>> spark.conf.set("spark.sql.shuffle.partitions","5")
>> 
>> spark.conf.set("spark.sql.streaming.forceDeleteTempCheckpointLocation","True")
>>
>> // create a static frame
>>   val staticDataFrame = spark.read.format("csv")
>> .option ("header","true")
>> .option("inferschema","true")
>> .load("/data/retail-data/by-day/*.csv")
>>
>>
>> staticDataFrame.createOrReplaceTempView("retail_data")
>> val staticSchema = staticDataFrame.schema
>>
>> staticDataFrame
>>   .selectExpr(
>> "CustomerId","UnitPrice * Quantity as total_cost", "InvoiceDate")
>>   .groupBy(col("CustomerId"),
>> window(col("InvoiceDate"),
>> "1 day"))
>>   .sum("total_cost")
>>   .sort(desc("sum(total_cost)"))
>>   .show(2)
>>
>> val streamingDataFrame = spark.readStream
>>   .schema(staticSchema)
>>   .format("csv")
>>   .option("maxFilesPerTrigger", 1)
>>   .option("header","true")
>>   .load("/data/retail-data/by-day/*.csv")
>>
>>   println(streamingDataFrame.isStreaming)
>>
>> // lazy operation so we will need to call a streaming action to start 
>> the action
>> val purchaseByCustomerPerHour = streamingDataFrame
>> .selectExpr(
>>   "CustomerId",
>>   "(UnitPrice * Quantity) as total_cost",
>>   "InvoiceDate")
>> .groupBy(
>>   col("CustomerId"), window(col("InvoiceDate"), "1 day"))
>> .sum("total_cost")
>>
>> // stream action to write to console
>> purchaseByCustomerPerHour.writeStream
>>   .format("console")
>>   .queryName("customer_purchases")
>>   .outputMode("complete")
>>   .start()
>>
>>   } // main
>>
>> } // object
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> val staticSchema = staticDataFrame.schema
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>


Apache Spark Serverless pool and preemptive execution support

2020-03-28 Thread Mania Abdi
Hi everyone,

I came across Databricks serverless analytics on Spark
and
Serverless pools. I was wondering if the Apache Spark also supports these
pools and if the Apache Spark resource manager supports
preemptive execution and auto-scaling?

Regards
Mania


Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Zahid Rahman
Very kind of you.

On Sat, 28 Mar 2020, 15:24 Russell Spitzer, 
wrote:

> This is probably more of a question for the user support list, but I
> believe I understand the issue.
>
> Schema inside of spark refers to the structure of the output rows, for
> example the schema for a particular dataframe could be
> (User: Int, Password: String) - Two Columns the first is User of type int
> and the second is Password of Type String.
>
> When you pass the schema from one reader to another, you are only
> copyting this structure, not all of the other options associated with the
> dataframe.
> This is usually useful when you are reading from sources with different
> options but data that needs to be read into the same structure.
>
> The other properties such as "format" and "options" exist independently of
> Schema. This is helpful if I was reading from both MySQL and
> a comma separated file for example. While the Schema is the same, the
> options like ("inferSchema") do not apply to both MySql and CSV and
> format actually picks whether to us "JDBC" or "CSV" so copying that
> wouldn't be helpful either.
>
> I hope this clears things up,
> Russ
>
> On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman  wrote:
>
>> Hi,
>> version: spark-3.0.0-preview2-bin-hadoop2.7
>>
>> As you can see from the code :
>>
>> STEP 1:  I  create a object of type static frame which holds all the
>> information to the datasource (csv files).
>>
>> STEP 2: Then I create a variable  called staticSchema  assigning the
>> information of the schema from the original static data frame.
>>
>> STEP 3: then I create another variable called val streamingDataFrame of
>> type spark.readStream.
>> and Into the .schema function parameters I pass the object staticSchema
>> which is meant to hold the information to the  csv files including the
>> .load(path) function etc.
>>
>> So then when I am creating val StreamingDataFrame and passing it
>> .schema(staticSchema)
>> the variable StreamingDataFrame  should have all the information.
>> I should only have to call .option("maxFilePerTrigger",1) and not .format
>> ("csv") .option("header","true").load("/data/retail-data/by-day/*.csv")
>> Otherwise what is the point of passing .schema(staticSchema) to
>> StreamingDataFrame.
>>
>> You can replicate it using the complete code below.
>>
>> import org.apache.spark.sql.SparkSession
>> import org.apache.spark.sql.functions.{window,column,desc,col}
>>
>> object RetailData {
>>
>>   def main(args: Array[String]): Unit = {
>>
>> // create spark session
>> val spark = 
>> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Retail 
>> Data").getOrCreate();
>> // set spark runtime  configuration
>> spark.conf.set("spark.sql.shuffle.partitions","5")
>> 
>> spark.conf.set("spark.sql.streaming.forceDeleteTempCheckpointLocation","True")
>>
>> // create a static frame
>>   val staticDataFrame = spark.read.format("csv")
>> .option ("header","true")
>> .option("inferschema","true")
>> .load("/data/retail-data/by-day/*.csv")
>>
>>
>> staticDataFrame.createOrReplaceTempView("retail_data")
>> val staticSchema = staticDataFrame.schema
>>
>> staticDataFrame
>>   .selectExpr(
>> "CustomerId","UnitPrice * Quantity as total_cost", "InvoiceDate")
>>   .groupBy(col("CustomerId"),
>> window(col("InvoiceDate"),
>> "1 day"))
>>   .sum("total_cost")
>>   .sort(desc("sum(total_cost)"))
>>   .show(2)
>>
>> val streamingDataFrame = spark.readStream
>>   .schema(staticSchema)
>>   .format("csv")
>>   .option("maxFilesPerTrigger", 1)
>>   .option("header","true")
>>   .load("/data/retail-data/by-day/*.csv")
>>
>>   println(streamingDataFrame.isStreaming)
>>
>> // lazy operation so we will need to call a streaming action to start 
>> the action
>> val purchaseByCustomerPerHour = streamingDataFrame
>> .selectExpr(
>>   "CustomerId",
>>   "(UnitPrice * Quantity) as total_cost",
>>   "InvoiceDate")
>> .groupBy(
>>   col("CustomerId"), window(col("InvoiceDate"), "1 day"))
>> .sum("total_cost")
>>
>> // stream action to write to console
>> purchaseByCustomerPerHour.writeStream
>>   .format("console")
>>   .queryName("customer_purchases")
>>   .outputMode("complete")
>>   .start()
>>
>>   } // main
>>
>> } // object
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> val staticSchema = staticDataFrame.schema
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>


Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Russell Spitzer
This is probably more of a question for the user support list, but I
believe I understand the issue.

Schema inside of spark refers to the structure of the output rows, for
example the schema for a particular dataframe could be
(User: Int, Password: String) - Two Columns the first is User of type int
and the second is Password of Type String.

When you pass the schema from one reader to another, you are only
copyting this structure, not all of the other options associated with the
dataframe.
This is usually useful when you are reading from sources with different
options but data that needs to be read into the same structure.

The other properties such as "format" and "options" exist independently of
Schema. This is helpful if I was reading from both MySQL and
a comma separated file for example. While the Schema is the same, the
options like ("inferSchema") do not apply to both MySql and CSV and
format actually picks whether to us "JDBC" or "CSV" so copying that
wouldn't be helpful either.

I hope this clears things up,
Russ

On Sat, Mar 28, 2020, 12:33 AM Zahid Rahman  wrote:

> Hi,
> version: spark-3.0.0-preview2-bin-hadoop2.7
>
> As you can see from the code :
>
> STEP 1:  I  create a object of type static frame which holds all the
> information to the datasource (csv files).
>
> STEP 2: Then I create a variable  called staticSchema  assigning the
> information of the schema from the original static data frame.
>
> STEP 3: then I create another variable called val streamingDataFrame of
> type spark.readStream.
> and Into the .schema function parameters I pass the object staticSchema
> which is meant to hold the information to the  csv files including the
> .load(path) function etc.
>
> So then when I am creating val StreamingDataFrame and passing it
> .schema(staticSchema)
> the variable StreamingDataFrame  should have all the information.
> I should only have to call .option("maxFilePerTrigger",1) and not .format
> ("csv") .option("header","true").load("/data/retail-data/by-day/*.csv")
> Otherwise what is the point of passing .schema(staticSchema) to
> StreamingDataFrame.
>
> You can replicate it using the complete code below.
>
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions.{window,column,desc,col}
>
> object RetailData {
>
>   def main(args: Array[String]): Unit = {
>
> // create spark session
> val spark = 
> SparkSession.builder().master("spark://192.168.0.38:7077").appName("Retail 
> Data").getOrCreate();
> // set spark runtime  configuration
> spark.conf.set("spark.sql.shuffle.partitions","5")
> 
> spark.conf.set("spark.sql.streaming.forceDeleteTempCheckpointLocation","True")
>
> // create a static frame
>   val staticDataFrame = spark.read.format("csv")
> .option ("header","true")
> .option("inferschema","true")
> .load("/data/retail-data/by-day/*.csv")
>
>
> staticDataFrame.createOrReplaceTempView("retail_data")
> val staticSchema = staticDataFrame.schema
>
> staticDataFrame
>   .selectExpr(
> "CustomerId","UnitPrice * Quantity as total_cost", "InvoiceDate")
>   .groupBy(col("CustomerId"),
> window(col("InvoiceDate"),
> "1 day"))
>   .sum("total_cost")
>   .sort(desc("sum(total_cost)"))
>   .show(2)
>
> val streamingDataFrame = spark.readStream
>   .schema(staticSchema)
>   .format("csv")
>   .option("maxFilesPerTrigger", 1)
>   .option("header","true")
>   .load("/data/retail-data/by-day/*.csv")
>
>   println(streamingDataFrame.isStreaming)
>
> // lazy operation so we will need to call a streaming action to start the 
> action
> val purchaseByCustomerPerHour = streamingDataFrame
> .selectExpr(
>   "CustomerId",
>   "(UnitPrice * Quantity) as total_cost",
>   "InvoiceDate")
> .groupBy(
>   col("CustomerId"), window(col("InvoiceDate"), "1 day"))
> .sum("total_cost")
>
> // stream action to write to console
> purchaseByCustomerPerHour.writeStream
>   .format("console")
>   .queryName("customer_purchases")
>   .outputMode("complete")
>   .start()
>
>   } // main
>
> } // object
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> val staticSchema = staticDataFrame.schema
>
>
>
>
>
>
>
>
>
>
>
>
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>