[FYI] Spark 2.2 on spark2:2.6-maint

2017-08-23 Thread Dong Joon Hyun
Hi, All.

It seems that Bikas is too busy to announce this.


  1.  Spark2:2.6-maint becomes Spark 2.2 in this morning. Thank you all.
According to Weiqing, RE will update the version as soon as possible. So far, 
it’s 2.2.0 instead of 2.2.0-2.6.3.0-XX.


  1.  The existing Jenkins works seamlessly (Build #40)
==> 13,646 tests (0 failures, 693 skipped, 3 hr 13 min.)


  1.  HDI Spark 2.2 Preview will be shipped next week.
==> The branch is still locked.

Bests,
Dongjoon.


Re: Increase Timeout or optimize Spark UT?

2017-08-20 Thread Dong Joon Hyun
+1 for any efforts to recover Jenkins!

Thank you for the direction.

Bests,
Dongjoon.

From: Reynold Xin 
Date: Sunday, August 20, 2017 at 5:53 PM
To: Dong Joon Hyun 
Cc: "dev@spark.apache.org" 
Subject: Re: Increase Timeout or optimize Spark UT?

It seems like it's time to look into how to cut down some of the test runtimes. 
Test runtimes will slowly go up given the way development happens. 3 hr is 
already a very long time for tests to run.


On Sun, Aug 20, 2017 at 5:45 PM, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:
Hi, All.

Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has been 
hitting the build timeout.

Please see the build time trend.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend

All recent 22 builds fail due to timeout directly/indirectly. The last success 
(SBT with Hadoop-2.7) is 15th August.

We may do the followings.


  1.  Increase Build Timeout (3 hr 30 min)
  2.  Optimize UTs (Scala/Java/Python/UT)

But, Option 1 will be the immediate solution for now . Could you update the 
Jenkins setup?

Bests,
Dongjoon.



Increase Timeout or optimize Spark UT?

2017-08-20 Thread Dong Joon Hyun
Hi, All.

Recently, Apache Spark master branch test (SBT with hadoop-2.7 / 2.6) has been 
hitting the build timeout.

Please see the build time trend.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/buildTimeTrend

All recent 22 builds fail due to timeout directly/indirectly. The last success 
(SBT with Hadoop-2.7) is 15th August.

We may do the followings.


  1.  Increase Build Timeout (3 hr 30 min)
  2.  Optimize UTs (Scala/Java/Python/UT)

But, Option 1 will be the immediate solution for now . Could you update the 
Jenkins setup?

Bests,
Dongjoon.


Re: spark pypy support?

2017-08-14 Thread Dong Joon Hyun
Hi, Tom.

What version of PyPy do you use?

In the Jenkins environment, `pypy` always passes like Python 2.7 and Python 3.4.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3340/consoleFull


Running PySpark tests

Running PySpark tests. Output is in 
/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.4', 
'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Starting test(python2.7): pyspark.mllib.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.tests
Starting test(pypy): pyspark.streaming.tests
Finished test(pypy): pyspark.tests (181s)
…

Tests passed in 1130 seconds


Bests,
Dongjoon.


From: Tom Graves 
Date: Monday, August 14, 2017 at 1:55 PM
To: "dev@spark.apache.org" 
Subject: spark pypy support?

Anyone know if pypy works with spark. Saw a jira that it was supported back in 
Spark 1.2 but getting an error when trying and not sure if its something with 
my pypy version of just something spark doesn't support.


AttributeError: 'builtin-code' object has no attribute 'co_filename'
Traceback (most recent call last):
  File "/app_main.py", line 75, in run_toplevel
  File "/homes/tgraves/mbe.py", line 40, in 
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py", line 834, 
in reduce
vals = self.mapPartitions(func).collect()
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py", line 808, 
in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py", line 
2440, in _jrdd
self._jrdd_deserializer, profiler)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py", line 
2373, in _wrap_function
pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/rdd.py", line 
2359, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/serializers.py", 
line 460, in dumps
return cloudpickle.dumps(obj, 2)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py", 
line 703, in dumps
cp.dump(obj)
  File "/home/gs/spark/latest/python/lib/pyspark.zip/pyspark/cloudpickle.py", 
line 160, in dump

Thanks,
Tom


Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Dong Joon Hyun
Thank you, Andrew and Reynold.

Yes, it will reduce the old Hive dependency eventually, at least, ORC codes.

And, Spark without `-Phive` can ORC like Parquet.

This is one milestone for `Feature parity for ORC with Parquet (SPARK-20901)`.

Bests,
Dongjoon

From: Reynold Xin 
Date: Thursday, August 10, 2017 at 3:23 PM
To: Andrew Ash 
Cc: Dong Joon Hyun , "dev@spark.apache.org" 
, Apache Spark PMC 
Subject: Re: Use Apache ORC in Apache Spark 2.3

Do you not use the catalog?


On Thu, Aug 10, 2017 at 3:22 PM, Andrew Ash 
mailto:and...@andrewash.com>> wrote:
I would support moving ORC from sql/hive -> sql/core because it brings me one 
step closer to eliminating Hive from my Spark distribution by removing -Phive 
at build time.

On Thu, Aug 10, 2017 at 9:48 AM, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:
Thank you again for coming and reviewing this PR.

So far, we discussed the followings.

1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
   - `sql/core` module gives more benefit than `sql/hive`.
   - Apache ORC library (`no-hive` version) is a general and resonably small 
library designed for non-hive apps.

2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
   - The previous #17980 , #17924, and #17943 are the complete examples 
containing this PR.
   - This PR is focusing on dependency only.

3. `Why don't we then create a separate orc module? Just copy a few of the 
files over?` (@rxin)
   -  Apache ORC library is the same with most of other data sources(CSV, JDBC, 
JSON, PARQUET, TEXT) which live inside `sql/core`
   - It's better to use as a library instead of copying ORC files because 
Apache ORC shaded jar has many files. We had better depend on Apache ORC 
community's effort until an unavoidable reason for copying occurs.

4. `I do worry in the future whether ORC would bring in a lot more jars` (@rxin)
   - The ORC core library's dependency tree is aggressively kept as small as 
possible. I've gone through and excluded unnecessary jars from our 
dependencies. I also kick back pull requests that add unnecessary new 
dependencies. (@omalley)

5. `In the long term, Spark should move to using only the vectorized reader in 
ORC's core” (@omalley)
- Of course.

I’ve been waiting for new comments and discussion since last week.
Apparently, there is no further comments except the last comment(5) from Owen 
in this week.

Please give your opinion if you think we need some change on the current PR 
(as-is).
FYI, there is one LGTM on the PR (as-is) and no -1 so far.

Thank you again for supporting new ORC improvement in Apache Spark.

Bests,
Dongjoon.


From: Dong Joon Hyun mailto:dh...@hortonworks.com>>
Date: Friday, August 4, 2017 at 8:05 AM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Cc: Apache Spark PMC mailto:priv...@spark.apache.org>>
Subject: Use Apache ORC in Apache Spark 2.3

Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 
1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet 
(SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the 
following benefits.

- Usability:
* Users can use `ORC` data sources without hive module (-Phive) like 
`Parquet` format.

- Stability & Maintanability:
* ORC 1.4 already has many fixes.
* In the future, Spark can upgrade ORC library independently from Hive
   (similar to Parquet library, too)
* Eventually, reduce the dependecy on old Hive 1.2.1.

- Speed:
* Last but not least, Spark can use both Spark `ColumnarBatch` and ORC 
`RowBatch` together
  which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the 
time frame of Spark 2.3.

- SPARK-21422: Depend on Apache ORC 1.4.0
- SPARK-20682: Add a new faster ORC data source based on Apache ORC
- SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
- SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important 
change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark 
community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.




Re: Use Apache ORC in Apache Spark 2.3

2017-08-10 Thread Dong Joon Hyun
Thank you again for coming and reviewing this PR.

So far, we discussed the followings.

1. `Why are we adding this to core? Why not just the hive module?` (@rxin)
   - `sql/core` module gives more benefit than `sql/hive`.
   - Apache ORC library (`no-hive` version) is a general and resonably small 
library designed for non-hive apps.

2. `Can we add smaller amount of new code to use this, too?` (@kiszk)
   - The previous #17980 , #17924, and #17943 are the complete examples 
containing this PR.
   - This PR is focusing on dependency only.

3. `Why don't we then create a separate orc module? Just copy a few of the 
files over?` (@rxin)
   -  Apache ORC library is the same with most of other data sources(CSV, JDBC, 
JSON, PARQUET, TEXT) which live inside `sql/core`
   - It's better to use as a library instead of copying ORC files because 
Apache ORC shaded jar has many files. We had better depend on Apache ORC 
community's effort until an unavoidable reason for copying occurs.

4. `I do worry in the future whether ORC would bring in a lot more jars` (@rxin)
   - The ORC core library's dependency tree is aggressively kept as small as 
possible. I've gone through and excluded unnecessary jars from our 
dependencies. I also kick back pull requests that add unnecessary new 
dependencies. (@omalley)

5. `In the long term, Spark should move to using only the vectorized reader in 
ORC's core” (@omalley)
- Of course.

I’ve been waiting for new comments and discussion since last week.
Apparently, there is no further comments except the last comment(5) from Owen 
in this week.

Please give your opinion if you think we need some change on the current PR 
(as-is).
FYI, there is one LGTM on the PR (as-is) and no -1 so far.

Thank you again for supporting new ORC improvement in Apache Spark.

Bests,
Dongjoon.


From: Dong Joon Hyun 
Date: Friday, August 4, 2017 at 8:05 AM
To: "dev@spark.apache.org" 
Cc: Apache Spark PMC 
Subject: Use Apache ORC in Apache Spark 2.3

Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 
1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet 
(SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the 
following benefits.

- Usability:
* Users can use `ORC` data sources without hive module (-Phive) like 
`Parquet` format.

- Stability & Maintanability:
* ORC 1.4 already has many fixes.
* In the future, Spark can upgrade ORC library independently from Hive
   (similar to Parquet library, too)
* Eventually, reduce the dependecy on old Hive 1.2.1.

- Speed:
* Last but not least, Spark can use both Spark `ColumnarBatch` and ORC 
`RowBatch` together
  which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the 
time frame of Spark 2.3.

- SPARK-21422: Depend on Apache ORC 1.4.0
- SPARK-20682: Add a new faster ORC data source based on Apache ORC
- SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
- SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important 
change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark 
community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.


Re: Use Apache ORC in Apache Spark 2.3

2017-08-04 Thread Dong Joon Hyun
Thank you so much, Owen!

Bests,
Dongjoon.


From: Owen O'Malley 
Date: Friday, August 4, 2017 at 9:59 AM
To: Dong Joon Hyun 
Cc: "dev@spark.apache.org" , Apache Spark PMC 

Subject: Re: Use Apache ORC in Apache Spark 2.3

The ORC community is really eager to get this work integrated in to Spark so 
that Spark users can have fast access to their ORC data. Let us know if we can 
help the integration.

Thanks,
   Owen

On Fri, Aug 4, 2017 at 8:05 AM, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:
Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 
1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet 
(SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the 
following benefits.

- Usability:
* Users can use `ORC` data sources without hive module (-Phive) like 
`Parquet` format.

- Stability & Maintanability:
* ORC 1.4 already has many fixes.
* In the future, Spark can upgrade ORC library independently from Hive
   (similar to Parquet library, too)
* Eventually, reduce the dependecy on old Hive 1.2.1.

- Speed:
* Last but not least, Spark can use both Spark `ColumnarBatch` and ORC 
`RowBatch` together
  which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the 
time frame of Spark 2.3.

- SPARK-21422: Depend on Apache ORC 1.4.0
- SPARK-20682: Add a new faster ORC data source based on Apache ORC
- SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
- SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important 
change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark 
community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.



Use Apache ORC in Apache Spark 2.3

2017-08-04 Thread Dong Joon Hyun
Hi, All.

Apache Spark always has been a fast and general engine, and
supports Apache ORC inside `sql/hive` module with Hive dependency since Spark 
1.4.X (SPARK-2883).
However, there are many open issues about `Feature parity for ORC with Parquet 
(SPARK-20901)` as of today.

With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get the 
following benefits.

- Usability:
* Users can use `ORC` data sources without hive module (-Phive) like 
`Parquet` format.

- Stability & Maintanability:
* ORC 1.4 already has many fixes.
* In the future, Spark can upgrade ORC library independently from Hive
   (similar to Parquet library, too)
* Eventually, reduce the dependecy on old Hive 1.2.1.

- Speed:
* Last but not least, Spark can use both Spark `ColumnarBatch` and ORC 
`RowBatch` together
  which means full vectorization support.

First of all, I'd love to improve Apache Spark in the following steps in the 
time frame of Spark 2.3.

- SPARK-21422: Depend on Apache ORC 1.4.0
- SPARK-20682: Add a new faster ORC data source based on Apache ORC
- SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
- SPARK-16060: Vectorized Orc Reader

I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
but the PRs seems to need more attention of PMC since this is an important 
change.
Since the discussion on Apache Spark 2.3 cadence is already started this week,
I thought it’s a best time to ask you about this.

Could anyone of you help me to proceed ORC improvement in Apache Spark 
community?

Please visit the minimal PR and JIRA issue as a starter.


  *   https://github.com/apache/spark/pull/18640
  *   https://issues.apache.org/jira/browse/SPARK-21422

Thank you in advance.

Bests,
Dongjoon Hyun.


Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

2017-08-01 Thread Dong Joon Hyun
+1 (non-binding)

Dongjoon.

From: Ryan Blue 
Reply-To: "rb...@netflix.com" 
Date: Tuesday, August 1, 2017 at 9:06 AM
To: Tom Graves 
Cc: Marcelo Vanzin , "dev@spark.apache.org" 

Subject: Re: [VOTE] [SPIP] SPARK-18085: Better History Server scalability

+1 (non-binding)

On Tue, Aug 1, 2017 at 6:48 AM, Tom Graves 
mailto:tgraves...@yahoo.com.invalid>> wrote:
+1.


Tom



On Monday, July 31, 2017, 12:28:02 PM CDT, Marcelo Vanzin 
mailto:van...@cloudera.com>> wrote:


Hey all,

Following the SPIP process, I'm putting this SPIP up for a vote. It's
been open for comments as an SPIP for about 3 weeks now, and had been
open without the SPIP label for about 9 months before that. There has
been no new feedback since it was tagged as an SPIP, so I'm assuming
all the people who looked at it are OK with the current proposal.

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following
technical reasons.

Thanks!

--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org




--
Ryan Blue
Software Engineer
Netflix


Re: Tests failing with run-tests.py SyntaxError

2017-07-28 Thread Dong Joon Hyun
I saw that error in the latest branch-2.1 build failure, too.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.7/579/console

But, the code was written in Jan 2016. Didn’t we run it on Python 2.6 without 
any problem?

ee74498de37 (Josh Rosen   2016-01-26 14:20:11 -0800 124) {m: 
set(m.dependencies).intersection(modules_to_test) for m in modules_to_test}, 
sort=True)

Bests,
Dongjoon.

From: Hyukjin Kwon 
Date: Friday, July 28, 2017 at 7:06 AM
To: Sean Owen 
Cc: dev 
Subject: Re: Tests failing with run-tests.py SyntaxError

Yes, that's my guess just given information here without a close look.

On 28 Jul 2017 11:03 pm, "Sean Owen" 
mailto:so...@cloudera.com>> wrote:
I see, does that suggest that a machine has 2.6, when it should use 2.7?

On Fri, Jul 28, 2017 at 2:58 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
That looks appearently due to dict comprehension which is, IIRC, not allowed in 
Python 2.6.x. I checked the release note for sure before 
-https://issues.apache.org/jira/browse/SPARK-20149

On 28 Jul 2017 9:56 pm, "Sean Owen" 
mailto:so...@cloudera.com>> wrote:

  File "./dev/run-tests.py", line 124

{m: set(m.dependencies).intersection(modules_to_test) for m in 
modules_to_test}, sort=True)

^

SyntaxError: invalid syntax

It seems like tests are failing intermittently with this type of error, which 
leads me to guess there's some difference in the Python interpreter on one or 
more machines but not all (?)

Does the error suggest anything to anyone who knows Python better than I?

The line has been around for a year so I don't think it's the script per se.



Re: Faster Spark on ORC with Apache ORC

2017-07-11 Thread Dong Joon Hyun
Hi, All.

Since Apache Spark 2.2 vote passed successfully last week,
I think it’s a good time for me to ask your opinions again about the following 
PR.

https://github.com/apache/spark/pull/17980  (+3,887, −86)

It’s for the following issues.


  *   SPARK-20728: Make ORCFileFormat configurable between sql/hive and sql/core
  *   SPARK-20682: Support a new faster ORC data source based on Apache ORC

Basically, the approach is trying to use the latest Apache ORC 1.4.0 officially.
You can switch between the legacy ORC data source and new ORC datasource.

Could you help me to progress this in order to improve Apache Spark 2.3?

Bests,
Dongjoon.

From: Dong Joon Hyun 
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org" 
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

- Usability: Users can use `ORC` data sources without hive module (-Phive)

- Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.


Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-05 Thread Dong Joon Hyun
+1 (non binding)

Bests,
Dongjoon.


From:  on behalf of Holden Karau 
Date: Wednesday, July 5, 2017 at 10:14 PM
To: Felix Cheung 
Cc: Denny Lee , Liang-Chi Hsieh , 
"dev@spark.apache.org" 
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC6)

+1 PySpark package pip installs into a virtualenv.

On Wed, Jul 5, 2017 at 9:56 PM, Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
+1 (non binding)
Tested R, R package on Ubuntu and Windows, CRAN checks, manual tests with 
steaming & udf.


_
From: Denny Lee mailto:denny.g@gmail.com>>
Sent: Monday, July 3, 2017 9:30 PM
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC6)
To: Liang-Chi Hsieh mailto:vii...@gmail.com>>, 
mailto:dev@spark.apache.org>>


+1 (non-binding)

On Mon, Jul 3, 2017 at 6:45 PM Liang-Chi Hsieh 
mailto:vii...@gmail.com>> wrote:
+1


Sameer Agarwal wrote
> +1
>
> On Mon, Jul 3, 2017 at 6:08 AM, Wenchen Fan <

> cloud0fan@

> > wrote:
>
>> +1
>>
>> On 3 Jul 2017, at 8:22 PM, Nick Pentreath <

> nick.pentreath@

> >
>> wrote:
>>
>> +1 (binding)
>>
>> On Mon, 3 Jul 2017 at 11:53 Yanbo Liang <

> ybliang8@

> > wrote:
>>
>>> +1
>>>
>>> On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier <
>>>

> hvanhovell@

>> wrote:
>>>
 +1

 On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida <


> ricardo.almeida@

>> wrote:

> +1 (non-binding)
>
> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn
> -Phive -Phive-thriftserver -Pscala-2.11 on
>
>- macOS 10.12.5 Java 8 (build 1.8.0_131)
>- Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
>
>
>
>
>
> On 1 Jul 2017 02:45, "Michael Armbrust" <

> michael@

> > wrote:
>
> Please vote on releasing the following candidate as Apache Spark
> version 2.2.0. The vote is open until Friday, July 7th, 2017 at 18:00
> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.2.0-rc6
> ;
> (a2c7b2133cfee7f
> a9abfaa2bfbfb637155466783)
>
> List of JIRA tickets resolved can be found with this filter
> ;
> .
>
> The release files, including signatures, digests, etc. can be found
> at:
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc6-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1245/
>
> The documentation corresponding to this release can be found at:
> https://people.apache.org/~pwendell/spark-releases/spark-
> 2.2.0-rc6-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking
> an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be
> worked on immediately. Everything else please retarget to 2.3.0 or
> 2.2.1.
>
> *But my bug isn't fixed!??!*
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.1.1.
>
>
>


>>>
>>
>
>
> --
> Sameer Agarwal
> Software Engineer | Databricks Inc.
> http://cs.berkeley.edu/~sameerag





-
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-2-0-RC6-tp21902p21914.html
Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org




--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Dong Joon Hyun
Hi, Nick.

Could you give us more information on your environment like R/JDK/OS?

Bests,
Dongjoon.

From: Nick Pentreath 
Date: Friday, June 9, 2017 at 1:12 AM
To: dev 
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)

All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R 
it seems).

However, I'm seeing the following test failure on R consistently: 
https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72


On Thu, 8 Jun 2017 at 08:48 Denny Lee 
mailto:denny.g@gmail.com>> wrote:
+1 non-binding

Tested on macOS Sierra, Ubuntu 16.04
test suite includes various test cases including Spark SQL, ML, GraphFrames, 
Structured Streaming


On Wed, Jun 7, 2017 at 9:40 PM vaquar khan 
mailto:vaquar.k...@gmail.com>> wrote:
+1 non-binding

Regards,
vaquar khan

On Jun 7, 2017 4:32 PM, "Ricardo Almeida" 
mailto:ricardo.alme...@actnowib.com>> wrote:
+1 (non-binding)

Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn -Phive 
-Phive-thriftserver -Pscala-2.11 on

  *   Ubuntu 17.04, Java 8 (OpenJDK 1.8.0_111)
  *   macOS 10.12.5 Java 8 (build 1.8.0_131)

On 5 June 2017 at 21:14, Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1.




Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Dong Joon Hyun
+1 (non-binding)

I built and tested on CentOS 7.3.1611 / OpenJDK 1.8.131 / R 3.3.3
with “-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver –Psparkr”.
Java/Scala/R tests passed as expected.

There are two minor things.


  1.  For the deprecation documentation issue 
(https://github.com/apache/spark/pull/18207),
I hope it goes to `Release Note` instead of blocking the current voting.

Something like `http://spark.apache.org/releases/spark-release-2-1-0.html`.


  1.  3rd Party test suite may fail due to the following difference
Previously, until Spark 2.1.1, the count was ‘1’.
It is https://issues.apache.org/jira/browse/SPARK-20954 .

scala> sql("create table t(a int)")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("desc table t").show
+--+-+---+
|  col_name|data_type|comment|
+--+-+---+
|# col_name|data_type|comment|
| a|  int|   null|
+--+-+---+

scala> sql("desc table t").count
res2: Long = 2

Bests,
Dongjoon.




From: Michael Armbrust 
Date: Monday, June 5, 2017 at 12:14 PM
To: "dev@spark.apache.org" 
Subject: [VOTE] Apache Spark 2.2.0 (RC4)

Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1.


Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-05 Thread Dong Joon Hyun
Hi, Michael.

Can we be more clear on deprecation messages in 2.2.0-RC4 documentation?

> Spark runs on Java 8+, Python 2.6+/3.4+ and R 3.1+.
-> Python 2.7+ ?
https://issues.apache.org/jira/browse/SPARK-12661  (Status: `Open`, Target 
Version: `2.2.0`, Label: `ReleaseNotes`)

> Note that support for Python 2.6 is deprecated as of Spark 2.0.0, and support 
> for Scala 2.10 and versions of Hadoop before 2.6 are deprecated as of Spark 
> 2.1.0, and may be removed in Spark 2.2.0.
-> Support for versions of Hadoop before 2.6.5 are removed as of 2.2.0.
-> Support for Scala 2.10 may be removed in Spark 2.3.0.

Since this is a doc only issue, can we revise this without affecting the RC4 
vote?

I created a PR for this, https://github.com/apache/spark/pull/18207.

Bests,
Dongjoon.


From: Michael Armbrust 
Date: Monday, June 5, 2017 at 12:51 PM
To: Sean Owen 
Cc: "dev@spark.apache.org" 
Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4)

I commented on that JIRA, I don't think that should block the release.  We can 
support both options long term if this vote passes.  Looks like the remaining 
JIRAs are doc/website updates that can happen after the vote or QA that should 
be done on this RC.  I think we are ready to start testing this release 
seriously!

On Mon, Jun 5, 2017 at 12:40 PM, Sean Owen 
mailto:so...@cloudera.com>> wrote:
Xiao opened a blocker on 2.2.0 this morning:

SPARK-20980 Rename the option `wholeFile` to `multiLine` for JSON and CSV

I don't see that this should block?

We still have 7 Critical issues:

SPARK-20520 R streaming tests failed on Windows
SPARK-20512 SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
SPARK-20499 Spark MLlib, GraphX 2.2 QA umbrella
SPARK-20508 Spark R 2.2 QA umbrella
SPARK-20513 Update SparkR website for 2.2
SPARK-20510 SparkR 2.2 QA: Update user guide for new features & APIs
SPARK-20507 Update MLlib, GraphX websites for 2.2

I'm going to assume that the R test issue isn't actually that big a deal, and 
that the 2.2 items are done. Anything that really is for 2.2 needs to block the 
release; Joseph what's the status on those?

On Mon, Jun 5, 2017 at 8:15 PM Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.2.0. 
The vote is open until Thurs, June 8th, 2017 at 12:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.2.0-rc4 
(377cfa8ac7ff7a8a6a6d273182e18ea7dc25ce7e)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1241/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.2.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.3.0 or 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.1.



Re: Spark Issues on ORC

2017-06-02 Thread Dong Joon Hyun
Thank you for confirming, Steve.

I removes the dependency of SPARK-20799 on SPARK-20901.

Bests,
Dongjoon.

From: Steve Loughran 
Date: Friday, June 2, 2017 at 4:42 AM
To: Dong Joon Hyun 
Cc: Apache Spark Dev 
Subject: Re: Spark Issues on ORC


On 26 May 2017, at 19:02, Dong Joon Hyun 
mailto:dh...@hortonworks.com>> wrote:

Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.


SPARK-20799   Unable to infer schema for ORC on reading ORC from S3


Fixed that one for you by changing title: SPARK-20799 Unable to infer schema 
for ORC/Parquet on S3N when secrets are in the URL

I'd recommended closing that as a WONTFIX as its related to some security work 
in HADOOP-3733 where Path.toString/toURI now strip out the AWS crentials, and 
the way things get passed around as Path.toString(), its losing them. As the 
current model meant that everything which logged a path would be logging AWS 
secrets, and the logs & exceptions weren't being treated as the sensitive 
documents they became the moment that happened.

It could could as a regression, but as it never worked if there was a "/" in 
the secret, it's always been a bit patchy.

If this is really needed then it could be pushed back into Hadoop 2.8.2 but 
disabled by default unless you set some option like 
"fs.s3a.insecure.secrets.in.URL".

Maybe also (somehow) changing to only support AWS Session token triples (id, 
session-secret, session-token), so that the damage caused by secrets in logs, 
bug reports &c are less destructive




Spark Issues on ORC

2017-05-26 Thread Dong Joon Hyun
Hi, All.

Today, while I’m looking over JIRA issues for Spark 2.2.0 in Apache Spark.
I noticed that there are many unresolved community requests and related efforts 
over `Feature parity for ORC with Parquet`.
Some examples I found are the following. I created SPARK-20901 to organize 
these although I’m not in the body to do this.
Please let me know if this is not a proper way in the Apache Spark community.
I think we can leverage or transfer the improvement of Parquet in Spark.

SPARK-11412   Support merge schema for ORC
SPARK-12417   Orc bloom filter options are not propagated during file write in 
spark
SPARK-14286   Empty ORC table join throws exception
SPARK-14387   Enable Hive-1.x ORC compatibility with 
spark.sql.hive.convertMetastoreOrc
SPARK-15347   Problem select empty ORC table
SPARK-15474   ORC data source fails to write and read back empty dataframe
SPARK-15682   Hive ORC partition write looks for root hdfs folder for existence
SPARK-15731   orc writer directory permissions
SPARK-15757   Error occurs when using Spark sql ""select"" statement on orc 
file …
SPARK-16060   Vectorized Orc reader
SPARK-16628   OrcConversions should not convert an ORC table represented by 
MetastoreRelation to HadoopFsRelation if …
SPARK-17047   Spark 2 cannot create ORC table when CLUSTERED
SPARK-18355   Spark SQL fails to read data from a ORC hive table that has a new 
column added to it
SPARK-18540   Wholestage code-gen for ORC Hive tables
SPARK-19109   ORC metadata section can sometimes exceed protobuf message size 
limit
SPARK-19122   Unnecessary shuffle+sort added if join predicates ordering differ 
from bucketing and sorting order
SPARK-19430   Cannot read external tables with VARCHAR columns if they're 
backed by ORC files written by Hive 1.2.1
SPARK-19809   NullPointerException on empty ORC file
SPARK-20515   Issue with reading Hive ORC tables having char/varchar columns in 
Spark SQL
SPARK-20682   Implement new ORC data source based on Apache ORC
SPARK-20728   Make ORCFileFormat configurable between sql/hive and sql/core
SPARK-20799   Unable to infer schema for ORC on reading ORC from S3

Bests,
Dongjoon.


Re: [Spark SQL] ceil and floor functions on doubles

2017-05-19 Thread Dong Joon Hyun
Hi, Anton.

It’s the same result with Hive, isn’t it?

hive> select 9.223372036854786E20, ceil(9.223372036854786E20);
OK
_c0  _c1
9.223372036854786E20 9223372036854775807
Time taken: 2.041 seconds, Fetched: 1 row(s)

Bests,
Dongjoon.

From: Anton Okolnychyi 
Date: Friday, May 19, 2017 at 7:26 AM
To: "dev@spark.apache.org" 
Subject: [Spark SQL] ceil and floor functions on doubles

Hi all,

I am wondering why the results of ceil and floor functions on doubles are 
internally casted to longs. This causes loss of precision since doubles can 
hold bigger numbers.

Consider the following example:

// 9.223372036854786E20 is greater than Long.MaxValue
val df = sc.parallelize(Array(("col", 9.223372036854786E20))).toDF()
df.createOrReplaceTempView("tbl")
spark.sql("select _2 AS original_value, ceil(_2) as ceil_result from 
tbl").show()

+-+-+
|original_value   | ceil_result   |
+-+-+
| 9.223372036854786E20 | 9223372036854775807 |
+-+-+

So, the original double value is rounded to 9223372036854775807, which is 
Long.MaxValue.
I think that it would be better to return 9.223372036854786E20 as it was (and 
as it is actually returned by math.ceil before the cast to long). If it is a 
problem, then I can fix this.

Best regards,
Anton


Re: Faster Spark on ORC with Apache ORC

2017-05-14 Thread Dong Joon Hyun
Hi, All.

As a continuation of SPARK-20682(Support a new faster ORC data source based on 
Apache ORC), I would like to suggest to make the default ORCFileFormat 
configurable between sql/hive and sql/core for the followings.

spark.read.orc(...)
spark.write.orc(...)

CREATE TABLE t
USING ORC
...

It's filed as SPARK-20728 and I made a PR for that, too.

In the new PR,

- You can test not only the PR but also your apps more easily with that 
option.
- To help reviews, the PR includes the updated benchmarks for both 
ORCReadBenchmark and ParquetReadBenchmark.

Since the previous PR is on-going, new PR inevitably have some of the previous 
PR.
I'll remove the duplication later in any ways.

Any opinions for Spark ORC improvement are welcome!

Thanks,
Dongjoon.?



____
From: Dong Joon Hyun 
Sent: Friday, May 12, 2017 10:49 AM
To: dev@spark.apache.org
Subject: Re: Faster Spark on ORC with Apache ORC

Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
SQL Single Int Column Scan: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

---
SQL Parquet Vectorized215 /  262 73.0  
13.7   1.0X
SQL Parquet MR   1946 / 2083  8.1 
123.7   0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is 
improved much more in general.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 102 /  123153.7  
 6.5   1.0X
SQL Parquet MR 409 /  436 38.5  
26.0   0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the 
following.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL ORC Vectorized 147 /  153107.3  
 9.3   1.0X
SQL ORC MR 338 /  369 46.5  
21.5   0.4X
HIVE ORC MR408 /  424 38.6  
25.9   0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems 
to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun mailto:dh...@hortonworks.com>>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org<mailto:dev@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

- Usability: Users can use `ORC` data sources without hive module (-Phive)

- Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.


Re: Faster Spark on ORC with Apache ORC

2017-05-12 Thread Dong Joon Hyun
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
SQL Single Int Column Scan: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

---
SQL Parquet Vectorized215 /  262 73.0  
13.7   1.0X
SQL Parquet MR   1946 / 2083  8.1 
123.7   0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is 
improved much more in general.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 102 /  123153.7  
 6.5   1.0X
SQL Parquet MR 409 /  436 38.5  
26.0   0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the 
following.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL ORC Vectorized 147 /  153107.3  
 9.3   1.0X
SQL ORC MR 338 /  369 46.5  
21.5   0.4X
HIVE ORC MR408 /  424 38.6  
25.9   0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems 
to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun mailto:dh...@hortonworks.com>>
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

- Usability: Users can use `ORC` data sources without hive module (-Phive)

- Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.


Faster Spark on ORC with Apache ORC

2017-05-09 Thread Dong Joon Hyun
Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

- Usability: Users can use `ORC` data sources without hive module (-Phive)

- Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.


Re: [VOTE] Apache Spark 2.1.1 (RC4)

2017-04-27 Thread Dong Joon Hyun
+1

I’ve got the same result (Scala/R test) on JDK 1.8.0_131 at this time.

Bests,
Dongjoon.

From: Reynold Xin mailto:r...@databricks.com>>
Date: Thursday, April 27, 2017 at 1:06 PM
To: Michael Armbrust mailto:mich...@databricks.com>>, 
"dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Re: [VOTE] Apache Spark 2.1.1 (RC4)

+1
On Thu, Apr 27, 2017 at 11:59 AM Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
I'll also +1

On Thu, Apr 27, 2017 at 4:20 AM, Sean Owen 
mailto:so...@cloudera.com>> wrote:
+1 , same result as with the last RC. All checks out for me.

On Thu, Apr 27, 2017 at 1:29 AM Michael Armbrust 
mailto:mich...@databricks.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 2.1.1. 
The vote is open until Sat, April 29th, 2018 at 18:00 PST and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.1
[ ] -1 Do not release this package because ...


To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v2.1.1-rc4 
(267aca5bd5042303a718d10635bc0d1a1596853f)

List of JIRA tickets resolved can be found with this 
filter.

The release files, including signatures, digests, etc. can be found at:
http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1232/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc4-docs/


FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.1.1?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.2 or 2.2.0.

But my bug isn't fixed!??!

In order to make timely releases, we will typically not hold the release unless 
the bug in question is a regression from 2.1.0.

What happened to RC1?

There were issues with the release packaging and as a result was skipped.



Re: [VOTE] Apache Spark 2.1.1 (RC3)

2017-04-19 Thread Dong Joon Hyun
+1

I tested RC3 on CentOS 7.3.1611/OpenJDK 1.8.0_121/R 3.3.3
with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver –Psparkr`

At the end of R test, I saw `Had CRAN check errors; see logs.`,
but tests passed and log file looks good.

Bests,
Dongjoon.

From: Reynold Xin mailto:r...@databricks.com>>
Date: Wednesday, April 19, 2017 at 3:41 PM
To: Marcelo Vanzin mailto:van...@cloudera.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Re: [VOTE] Apache Spark 2.1.1 (RC3)

+1

On Wed, Apr 19, 2017 at 3:31 PM, Marcelo Vanzin 
mailto:van...@cloudera.com>> wrote:
+1 (non-binding).

Ran the hadoop-2.6 binary against our internal tests and things look good.

On Tue, Apr 18, 2017 at 11:59 AM, Michael Armbrust
mailto:mich...@databricks.com>> wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.1
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.1-rc3
> (2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>
> List of JIRA tickets resolved can be found with this filter.
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1230/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>
>
> FAQ
>
> How can I help test this release?
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> What should happen to JIRA tickets still targeting 2.1.1?
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>
> But my bug isn't fixed!??!
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.0.
>
> What happened to RC1?
>
> There were issues with the release packaging and as a result was skipped.



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org