Re: Tungsten off heap memory access for C++ libraries

2015-08-30 Thread Paul Weiss
Reynold,

That is great to hear.  Definitely interested in how 2. is being
implemented and how it will be exposed in C++.  One important aspect of
leveraging the off heap memory is how the data is organized as well as
being able to easily access it from the C++ side.  For example how would
you store a multi dimensional array of doubles and how would you specify
that?  Perhaps Avro or Protobuf could be used for storing complex nested
structures although making that a zero copy could be a challenge.
Regardless of how the internals lays the data out in memory the important
requirements are:

a) ensuring zero copy
b) providing a friendly api on the C++ side so folks don't have to deal
with raw bytes, serialization, and JNI
c) ability to specify a complex (multi type and nested) structure via a
schema for memory storage (compile time generated would be sufficient but
run time dynamically would be extremely flexible)

Perhaps a simple way to accomplish would be to enhance dataframes to have a
C++ api that can access the off-heap memory in a clean way from Spark (in
process and w/ zero copy).

Also, is this work being done on a branch I could look into further and try
out?

thanks,
-paul



On Sat, Aug 29, 2015 at 9:40 PM, Reynold Xin  wrote:

> Supporting non-JVM code without memory copying and serialization is
> actually one of the motivations behind Tungsten. We didn't talk much about
> it since it is not end-user-facing and it is still too early. There are a
> few challenges still:
>
> 1. Spark cannot run entirely in off-heap mode (by entirely here I'm
> referring to all the data-plane memory, not control-plane such as RPCs
> since those don't matter much). There is nothing fundamental. It just takes
> a while to make sure all code paths allocate/free memory using the proper
> allocators.
>
> 2. The memory layout of data is still in flux, since we are only 4 months
> into Tungsten. They will change pretty frequently for the foreseeable
> future, and as a result, the C++ side of things will have change as well.
>
>
>
> On Sat, Aug 29, 2015 at 12:29 PM, Timothy Chen  wrote:
>
>> I would also like to see data shared off-heap to a 3rd party C++
>> library with JNI, I think the complications would be how to memory
>> manage this and make sure the 3rd party libraries also adhere to the
>> access contracts as well.
>>
>> Tim
>>
>> On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss 
>> wrote:
>> > Hi,
>> >
>> > Would the benefits of project tungsten be available for access by
>> non-JVM
>> > programs directly into the off-heap memory?  Spark using dataframes w/
>> the
>> > tungsten improvements will definitely help analytics within the JVM
>> world
>> > but accessing outside 3rd party c++ libraries is a challenge especially
>> when
>> > trying to do it with a zero copy.
>> >
>> > Ideally the off heap memory would be accessible to a non JVM program
>> and be
>> > invoked in process using JNI per each partition.  The alternatives to
>> this
>> > involve additional costs of starting another process if using pipes as
>> well
>> > as the additional copy all the data.
>> >
>> > In addition to read only non-JVM access in process would there be a way
>> to
>> > share the dataframe that is in memory out of process and across spark
>> > contexts.  This way an expensive complicated initial build up of a
>> dataframe
>> > would not have to be replicated as well not having to pay the penalty
>> of the
>> > startup costs on failure.
>> >
>> > thanks,
>> >
>> > -paul
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-30 Thread Sandy Ryza
+1 (non-binding)
built from source and ran some jobs against YARN

-Sandy

On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan  wrote:

>
> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>
> Regards,
> Vaquar khan
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. joins,sql,set operations,udf OK
>
> Cheers
> 
>
> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc2:
>>
>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc2) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already
>> present in 1.4, minor regressions, or bugs related to new features will not
>> block this release.
>>
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==
>> Major changes to help you focus your testing
>> ==
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, includin

[ANNOUNCE] New testing capabilities for pull requests

2015-08-30 Thread Patrick Wendell
Hi All,

For pull requests that modify the build, you can now test different
build permutations as part of the pull request builder. To trigger
these, you add a special phrase to the title of the pull request.
Current options are:

[test-maven] - run tests using maven and not sbt
[test-hadoop1.0] - test using older hadoop versions (can use 1.0, 2.0,
2.2, and 2.3).

The relevant source code is here:
https://github.com/apache/spark/blob/master/dev/run-tests-jenkins#L193

This is useful because it allows up-front testing of build changes to
avoid breaks once a patch has already been merged.

I've documented this on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: IOError on createDataFrame

2015-08-30 Thread Akhil Das
Why not attach a bigger hard disk to the machines and point your
SPARK_LOCAL_DIRS to it?

Thanks
Best Regards

On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti 
wrote:

> Hello,
>
> Similar to the thread below [1], when I tried to create an RDD from a 4GB
> pandas dataframe I encountered the error
>
> TypeError: cannot create an RDD from type: 
>
> However looking into the code shows this is raised from a generic "except
> Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
> debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
> space:
>
> -> rdd = self._sc.parallelize(data)
> (Pdb)
> *IOError: (28, 'No space left on device')*
>
> In this case, creating an RDD from a large matrix (~50mill rows) is
> required
> for us. I'm a bit concerned about spark's process here:
>
>a. turning the dataframe into records (data.to_records)
>b. writing it to tmp
>c. reading it back again in scala.
>
> Is there a better way? The intention would be to operate on slices of this
> large dataframe using numpy operations via spark's transformations and
> actions.
>
> Thanks,
> FDS
>
> 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>