Re: [DISCUSS] SPIP: Standardize SQL logical plans

2018-07-11 Thread Wenchen Fan
Hi Ryan,

Great job on this! Shall we call a vote for the plan standardization SPIP?
I think this is a good idea and we should do it.

Notes:
We definitely need new user-facing APIs to produce these new logical plans
like DeleteData. But we need a design doc for these new APIs after the SPIP
passed.
We definitely need the data source to provide the ability to
create/drop/alter/lookup tables, but that belongs to the other SPIP and
should be voted separately.

Thanks,
Wenchen

On Fri, Apr 20, 2018 at 5:01 AM Ryan Blue  wrote:

> Hi everyone,
>
> A few weeks ago, I wrote up a proposal to standardize SQL logical plans
> 
>  and
> a supporting design doc for data source catalog APIs
> .
> From the comments on those docs, it looks like we mostly have agreement
> around standardizing plans and around the data source catalog API.
>
> We still need to work out details, like the transactional API extension,
> but I'd like to get started implementing those proposals so we have
> something working for the 2.4.0 release. I'm starting this thread because I
> think we're about ready to vote on the proposal
> 
> and I'd like to get any remaining discussion going or get anyone that
> missed this to read through the docs.
>
> Thanks!
>
> rb
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Creating JDBC source table schema(DDL) dynamically

2018-07-11 Thread Kadam, Gangadhar (GE Aviation, Non-GE)
Hi All,

I am trying to build a spark application which will  read the data from 
Postgresql (source)  one environment  and write it to  postgreSQL, Aurora 
(target)  on a dfiffernt environment  (like to PROD to QA or QA to PROD etc) 
using spark JDBC.

When I am loading the dataframe back to target DB, I would like to ensure the 
same schema as the source table schema using

val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
""".stripMargin


.option("createTableColumnTypes", targetTableSchema )

I would like to know if there is way I can create this targetTableSchema 
(source table DDL) variable directly from source table or from a csv file. I 
don’t want spark to enforce its default schema.  Based on the table name, How 
do I  get the DDL created dynamically to pass it to targetTableSchema variable 
as a string.

Currently I am updating targetTableSchema manually  and looking for some 
pointer to automate it.


Below is my code

// Define the parameter
val sourceDb: String = args(0)
val targetDb: String = args(1)
val sourceTable: String = args(2)
val targetTable: String = args(3)
val sourceEnv: String = args(4)
val targetEnv: String = args(5)

println("Arguments Provided: " + sourceDb, targetDb,sourceTable, targetTable, 
sourceEnv, targetEnv)

// Define the spark session
val spark: SparkSession = SparkSession
  .builder()
  .appName("Ca-Data-Transporter")
  .master("local")
  .config("driver", "org.postgresql.Driver")
  .getOrCreate()

// define the input directory
val inputDir: String = 
"/Users/gangadharkadam/projects/ca-spark-apps/src/main/resources/"

// Define the source DB properties
val sourceParmFile: String = if (sourceDb == "RDS") {
"rds-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "AURORA") {
"aws-db-parms-" + sourceEnv + ".txt"
  }
  else if (sourceDb == "GP") {
"gp-db-parms-" + sourceEnv + ".txt"
  }
  else "NA"

println(sourceParmFile)

val sourceDbParms: Properties = new Properties()
sourceDbParms.load(new FileInputStream(new File(inputDir + sourceParmFile)))
val sourceDbJdbcUrl: String = sourceDbParms.getProperty("jdbcUrl")

println(s"$sourceDb")
println(s"$sourceDbJdbcUrl")

// Define the target DB properties
val targetParmFile: String = if (targetDb == "RDS") {
s"rds-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "AURORA") {
s"aws-db-parms-" + targetEnv + ".txt"
  }
  else if (targetDb == "GP") {
s"gp-db-parms-" + targetEnv + ".txt"
  } else "aws-db-parms-$targetEnv.txt"

println(targetParmFile)

val targetDbParms: Properties = new Properties()
targetDbParms.load(new FileInputStream(new File(inputDir + targetParmFile)))
val targetDbJdbcUrl: String = targetDbParms.getProperty("jdbcUrl")

println(s"$targetDb")
println(s"$targetDbJdbcUrl")

// Read the source table as dataFrame
val sourceDF: DataFrame = spark
  .read
  .jdbc(url = sourceDbJdbcUrl,
table = sourceTable,
sourceDbParms
  )
  //.filter("site_code is not null")

sourceDF.printSchema()
sourceDF.show()

val sourceDF1 = sourceDF.repartition(
  sourceDF("organization_id")
  //sourceDF("plan_id")
)


val targetTableSchema: String =
  """
|  operating_unit_nm character varying(20),
|  organization_id integer,
|  organization_cd character varying(30),
|  requesting_organization_id integer,
|  requesting_organization_cd character varying(50),
|  owning_organization_id integer,
|  owning_organization_cd character varying(50)
  """.stripMargin


// write the dataFrame
sourceDF1
  .write
  .option("createTableColumnTypes", targetTableSchema )
  .mode(saveMode = "Overwrite")
  .option("truncate", "true")
  .jdbc(targetDbJdbcUrl, targetTable, targetDbParms)


Thanks!
Gangadhar Kadam
Sr. Data Engineer
M + 1 (401) 588 2269

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Saisai Shao
Hi Sean,

The doc for RC1 is not usable because of sphinx issue. It should be rebuilt
with python3 to avoid the issue. Also there's one more blocking issue in
SQL, so I will wait for that to cut a new RC.

Sean Owen  于2018年7月12日周四 上午9:05写道:

> I guess my question is just whether the Python docs are usable or not in
> this RC. They looked reasonable to me but I don't know enough to know what
> the issue was. If the result is usable, then there's no problem here, even
> if something could be fixed/improved later.
>
> On Sun, Jul 8, 2018 at 7:25 PM Saisai Shao  wrote:
>
>> Hi Sean,
>>
>> SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
>> with this issue so still using python2 to generate docs.
>>
>> In the JIRA it mentioned that python3 with sphinx could workaround this
>> issue. @Hyukjin Kwon  would you please help to
>> clarify?
>>
>> Thanks
>> Saisai
>>
>>
>> Xiao Li  于2018年7月9日周一 上午1:59写道:
>>
>>> Three business days might be too short. Let us open the vote until the
>>> end of this Friday (July 13th)?
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>>>
 Just checking that the doc issue in
 https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
 this release?

 This was pointed out as an example of a broken doc:

 https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

 Here it is in 2.3.2 RC1:

 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

 It wasn't immediately obvious to me whether this addressed the issue
 that was identified or not.


 Otherwise nothing is open for 2.3.2, sigs and license look good, tests
 pass as last time, etc.

 +1

 On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.3.2.
>
> The vote is open until July 11th PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc1
> (commit 4df06b45160241dbb331153efbb25703f913c192):
> https://github.com/apache/spark/tree/v2.3.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1277/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following
> URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> PS. This is my first time to do release, please help to check if
> everything is landing correctly. Thanks ^-^
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Sean Owen
I guess my question is just whether the Python docs are usable or not in
this RC. They looked reasonable to me but I don't know enough to know what
the issue was. If the result is usable, then there's no problem here, even
if something could be fixed/improved later.

On Sun, Jul 8, 2018 at 7:25 PM Saisai Shao  wrote:

> Hi Sean,
>
> SPARK-24530 is not included in this RC1 release. Actually I'm so familiar
> with this issue so still using python2 to generate docs.
>
> In the JIRA it mentioned that python3 with sphinx could workaround this
> issue. @Hyukjin Kwon  would you please help to
> clarify?
>
> Thanks
> Saisai
>
>
> Xiao Li  于2018年7月9日周一 上午1:59写道:
>
>> Three business days might be too short. Let us open the vote until the
>> end of this Friday (July 13th)?
>>
>> Cheers,
>>
>> Xiao
>>
>> 2018-07-08 10:15 GMT-07:00 Sean Owen :
>>
>>> Just checking that the doc issue in
>>> https://issues.apache.org/jira/browse/SPARK-24530 is worked around in
>>> this release?
>>>
>>> This was pointed out as an example of a broken doc:
>>>
>>> https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>
>>> Here it is in 2.3.2 RC1:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/_site/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
>>>
>>> It wasn't immediately obvious to me whether this addressed the issue
>>> that was identified or not.
>>>
>>>
>>> Otherwise nothing is open for 2.3.2, sigs and license look good, tests
>>> pass as last time, etc.
>>>
>>> +1
>>>
>>> On Sun, Jul 8, 2018 at 3:30 AM Saisai Shao 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.2.

 The vote is open until July 11th PST and passes if a majority +1 PMC
 votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.3.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.3.2-rc1
 (commit 4df06b45160241dbb331153efbb25703f913c192):
 https://github.com/apache/spark/tree/v2.3.2-rc1

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1277/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/

 The list of bug fixes going into 2.3.2 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12343289

 PS. This is my first time to do release, please help to check if
 everything is landing correctly. Thanks ^-^

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.2?
 ===

 The current list of open tickets targeted at 2.3.2 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 2.3.2

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

>>>
>>


Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Yanbo Liang
+1

On Tue, Jul 10, 2018 at 10:15 PM Saisai Shao  wrote:

> https://issues.apache.org/jira/browse/SPARK-24530 is just merged, I will
> cancel this vote and prepare a new RC2 cut with doc fixed.
>
> Thanks
> Saisai
>
> Wenchen Fan  于2018年7月11日周三 下午12:25写道:
>
>> +1
>>
>> On Wed, Jul 11, 2018 at 1:31 AM John Zhuge  wrote:
>>
>>> +1
>>>
>>> On Sun, Jul 8, 2018 at 1:30 AM Saisai Shao 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.2.

 The vote is open until July 11th PST and passes if a majority +1 PMC
 votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 2.3.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v2.3.2-rc1
 (commit 4df06b45160241dbb331153efbb25703f913c192):
 https://github.com/apache/spark/tree/v2.3.2-rc1

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1277/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc1-docs/

 The list of bug fixes going into 2.3.2 can be found at the following
 URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12343289

 PS. This is my first time to do release, please help to check if
 everything is landing correctly. Thanks ^-^

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala
 you can add the staging repository to your projects resolvers and test
 with the RC (make sure to clean up the artifact cache before/after so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.2?
 ===

 The current list of open tickets targeted at 2.3.2 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 2.3.2

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a regression
 that has not been correctly targeted please ping me or a committer to
 help target the issue.

>>>
>>>
>>> --
>>> John
>>>
>>


Re: [build system] ubuntu workers temporarily offline

2018-07-11 Thread shane knapp
ok, things seem much happier now.

On Wed, Jul 11, 2018 at 8:57 PM, shane knapp  wrote:

> i'm seeing some strange docker/minikube errors, so i'm currently rebooting
> the boxes.  when they're back up, i will retrigger any killed builds and
> send an all-clear.
>
> On Wed, Jul 11, 2018 at 7:40 PM, shane knapp  wrote:
>
>> done, and the workers are back online.
>>
>> $ pssh -h ubuntu_workers.txt -i "minikube version"
>> [1] 12:37:23 [SUCCESS] amp-jenkins-staging-worker-01.amp
>> minikube version: v0.28.0
>> [2] 12:37:24 [SUCCESS] amp-jenkins-staging-worker-02.amp
>> minikube version: v0.28.0
>>
>>
>>
>> On Wed, Jul 11, 2018 at 7:34 PM, shane knapp  wrote:
>>
>>> i'll be taking amp-jenkins-staging-worker-0{1,2} offline to upgrade
>>> minikube to v0.28.0.
>>>
>>> this is currently blocking:  https://github.com/apache/spark/pull/21583
>>>
>>> this should be a relatively short downtime, and i'll reply back here
>>> when it's done.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] ubuntu workers temporarily offline

2018-07-11 Thread shane knapp
i'm seeing some strange docker/minikube errors, so i'm currently rebooting
the boxes.  when they're back up, i will retrigger any killed builds and
send an all-clear.

On Wed, Jul 11, 2018 at 7:40 PM, shane knapp  wrote:

> done, and the workers are back online.
>
> $ pssh -h ubuntu_workers.txt -i "minikube version"
> [1] 12:37:23 [SUCCESS] amp-jenkins-staging-worker-01.amp
> minikube version: v0.28.0
> [2] 12:37:24 [SUCCESS] amp-jenkins-staging-worker-02.amp
> minikube version: v0.28.0
>
>
>
> On Wed, Jul 11, 2018 at 7:34 PM, shane knapp  wrote:
>
>> i'll be taking amp-jenkins-staging-worker-0{1,2} offline to upgrade
>> minikube to v0.28.0.
>>
>> this is currently blocking:  https://github.com/apache/spark/pull/21583
>>
>> this should be a relatively short downtime, and i'll reply back here when
>> it's done.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


CVE-2018-1334 Apache Spark local privilege escalation vulnerability

2018-07-11 Thread Sean Owen
Severity: High

Vendor: The Apache Software Foundation

Versions affected:
Spark versions through 2.1.2
Spark 2.2.0 to 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when
using PySpark or SparkR, it's possible for a different local user to
connect to the Spark application and impersonate the user running the Spark
application.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer
Otherwise, affected users should avoid using PySpark and SparkR in
multi-user environments.

Credit:
Nehmé Tohmé, Cloudera, Inc.

References:
https://spark.apache.org/security.html


CVE-2018-8024 Apache Spark XSS vulnerability in UI

2018-07-11 Thread Sean Owen
Severity: Medium

Vendor: The Apache Software Foundation

Versions Affected:
Spark versions through 2.1.2
Spark 2.2.0 through 2.2.1
Spark 2.3.0

Description:
In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's
possible for a malicious user to construct a URL pointing to a Spark
cluster's UI's job and stage info pages, and if a user can be tricked into
accessing the URL, can be used to cause script to execute and expose
information from the user's view of the Spark UI. While some browsers like
recent versions of Chrome and Safari are able to block this type of attack,
current versions of Firefox (and possibly others) do not.

Mitigation:
1.x, 2.0.x, and 2.1.x users should upgrade to 2.1.3 or newer
2.2.x users should upgrade to 2.2.2 or newer
2.3.x users should upgrade to 2.3.1 or newer

Credit:
Spencer Gietzen, Rhino Security Labs

References:
https://spark.apache.org/security.html


Re: [build system] ubuntu workers temporarily offline

2018-07-11 Thread shane knapp
done, and the workers are back online.

$ pssh -h ubuntu_workers.txt -i "minikube version"
[1] 12:37:23 [SUCCESS] amp-jenkins-staging-worker-01.amp
minikube version: v0.28.0
[2] 12:37:24 [SUCCESS] amp-jenkins-staging-worker-02.amp
minikube version: v0.28.0



On Wed, Jul 11, 2018 at 7:34 PM, shane knapp  wrote:

> i'll be taking amp-jenkins-staging-worker-0{1,2} offline to upgrade
> minikube to v0.28.0.
>
> this is currently blocking:  https://github.com/apache/spark/pull/21583
>
> this should be a relatively short downtime, and i'll reply back here when
> it's done.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] ubuntu workers temporarily offline

2018-07-11 Thread shane knapp
i'll be taking amp-jenkins-staging-worker-0{1,2} offline to upgrade
minikube to v0.28.0.

this is currently blocking:  https://github.com/apache/spark/pull/21583

this should be a relatively short downtime, and i'll reply back here when
it's done.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu