[Catalyst] Code Generation and the Constant Pool Limit

2017-05-12 Thread Aleksander Eskilson
Hi all,

I want to take a moment to highlight an issue and invite hopefully some
developers to review a pull request
 [1] for SPARK-18016
 [2]. Code generated by
Catalyst currently places all split methods and variables into single
classes. When the data schema is sufficiently complex (wide/deeply nested),
the volume of generated constants declared either in methods or as global
variables exceeds a Java class's Constant Pool Limit, causing an exception.
Without a fix to this issue, there is an effective limit on the complexity
of data that can be marshaled to a DataFrame/Dataset. A method for
addressing this issue is discussed in the pull request. The change is
non-trivial, so I'm hoping to get a few sets of eyes on it, especially ones
that might be more familiar with the preferred direction of the Catalyst
project.

--
Alek Eskilson

[1] - https://github.com/apache/spark/pull/16648
[2] - https://issues.apache.org/jira/browse/SPARK-18016


Re: Uploading PySpark 2.1.1 to PyPi

2017-05-12 Thread Sameer Agarwal
Holden,

Thanks again for pushing this forward! Out of curiosity, did we get an
approval from the PyPi folks?

Regards,
Sameer

On Mon, May 8, 2017 at 11:44 PM, Holden Karau  wrote:

> So I have a PR to add this to the release process documentation - I'm
> waiting on the necessary approvals from PyPi folks before I merge that
> incase anything changes as a result of the discussion (like uploading to
> the legacy host or something). As for conda-forge, it's not something we
> need to do, but I'll add a note about pinging them when we make a new
> release so their users can keep up to date easily. The parent JIRA for PyPi
> related tasks is SPARK-18267 :)
>
>
> On Mon, May 8, 2017 at 6:22 PM cloud0fan  wrote:
>
>> Hi Holden,
>>
>> Thanks for working on it! Do we have a JIRA ticket to track this? We
>> should
>> make it part of the release process in all the following Spark releases,
>> and
>> it will be great if we have a JIRA ticket to record the detailed steps of
>> doing this and even automate it.
>>
>> Thanks,
>> Wenchen
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-developers
>> -list.1001551.n3.nabble.com/Uploading-PySpark-2-1-1-to-
>> PyPi-tp21531p21532.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag


Run an OS command or script supplied by the user at the start of each executor

2017-05-12 Thread Luca Canali
Hi,

I have recently experimented with a few ways to run OS commands from the 
executors (in a YARN deployment) for a specific use case where we want to 
interact with an external system of interest for our environment. From that 
experience I thought that having the possibility to spawn a script at the start 
of each executors can be quite handy in a few cases and maybe more people are 
interested. For example I am think of case when interacting with external 
systems/APIs, or for injecting custom configurations via scripts distributed to 
the executors and/or for spawning custom monitoring tasks, etc. 
They are probable all niche cases but the feature seems quite easy to implement.
I just wanted to check with the list if something like this has already come up 
in the past and/or there are thoughts about it or details that I have 
overlooked.
My simple proof of concept for implementing a "startup command" on the 
executors can be found at:
https://github.com/LucaCanali/spark/commit/e294a1f0d55af115f45fa6d2d7dcf81f751955fa
I can put all this in a Jira in case people here think it makes sense.

Thanks,
Luca


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Faster Spark on ORC with Apache ORC

2017-05-12 Thread Dong Joon Hyun
Hi,

I have been wondering how much Apache Spark 2.2.0 will be improved more again.

This is the prior record from the source code.


Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
SQL Single Int Column Scan: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

---
SQL Parquet Vectorized215 /  262 73.0  
13.7   1.0X
SQL Parquet MR   1946 / 2083  8.1 
123.7   0.1X


So, I got a similar (but slower) machine and ran ParquetReadBenchmark on it.

Apache Spark seems to be improved much again. But strangely, MR version is 
improved much more in general.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 102 /  123153.7  
 6.5   1.0X
SQL Parquet MR 409 /  436 38.5  
26.0   0.3X



For ORC, my PR ( https://github.com/apache/spark/pull/17924 ) looks like the 
following.


Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

SQL Single Int Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL ORC Vectorized 147 /  153107.3  
 9.3   1.0X
SQL ORC MR 338 /  369 46.5  
21.5   0.4X
HIVE ORC MR408 /  424 38.6  
25.9   0.4X


Given that this is an initial PR without optimization, ORC Vectorization seems 
to catch up much.


Bests,
Dongjoon.


From: Dongjoon Hyun >
Date: Tuesday, May 9, 2017 at 6:15 PM
To: "dev@spark.apache.org" 
>
Subject: Faster Spark on ORC with Apache ORC

Hi, All.

Apache Spark always has been a fast and general engine, and
since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with Hive 
dependency.

With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC faster and 
get some benefits.

- Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together which 
means full vectorization support.

- Stability: Apache ORC 1.4.0 already has many fixes and we can depend on 
ORC community effort in the future.

- Usability: Users can use `ORC` data sources without hive module (-Phive)

- Maintainability: Reduce the Hive dependency and eventually remove some 
old legacy code from `sql/hive` module.

As a first step, I made a PR adding a new ORC data source into `sql/core` 
module.

https://github.com/apache/spark/pull/17924  (+ 3,691 lines, -0)

Could you give some opinions on this approach?

Bests,
Dongjoon.


Re: RandomForest caching

2017-05-12 Thread madhu phatak
Hi,
I opened a jira.

https://issues.apache.org/jira/browse/SPARK-20723

Can some one have a look?

On Fri, Apr 28, 2017 at 1:34 PM, madhu phatak  wrote:

> Hi,
>
> I am testing RandomForestClassification with 50gb of data which is cached
> in memory. I have 64gb of ram, in which 28gb is used for original dataset
> caching.
>
> When I run random forest, it caches around 300GB of intermediate data
> which un caches the original dataset. This caching is triggered by below
> code in RandomForest.scala
>
> ```
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate,
> numTrees, withReplacement, seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>
> ```
>
> As I don't have control over storage level, I cannot make sure original
> dataset stays in memory for other interactive tasks when random forest is
> running.
>
> Is it a good idea to make this storage level a user parameter? If so I can
> open a jira issue and give pr for the same.
>
> --
> Regards,
> Madhukara Phatak
> http://datamantra.io/
>



-- 
Regards,
Madhukara Phatak
http://datamantra.io/