Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
BTW you can also use rdd.partitions() to get a list of Partition objects and see how many there are. On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Partitioners also work in local mode, the only question is how to see which data fell into each partition

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-09-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14120657#comment-14120657 ] Matei Zaharia commented on SPARK-3215: -- Thanks Marcelo! Just a few notes on the API

[jira] [Updated] (SPARK-3052) Misleading and spurious FileSystem closed errors whenever a job fails while reading from Hadoop

2014-09-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3052: - Assignee: Sandy Ryza Misleading and spurious FileSystem closed errors whenever a job fails while

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118913#comment-14118913 ] Matei Zaharia commented on SPARK-3098: -- Yup, let's maybe document this for now. I'll

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118918#comment-14118918 ] Matei Zaharia commented on SPARK-3098: -- Created SPARK-3356 to track this. In some

[jira] [Created] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2014-09-02 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3356: Summary: Document when RDD elements' ordering within partitions is nondeterministic Key: SPARK-3356 URL: https://issues.apache.org/jira/browse/SPARK-3356 Project

[jira] [Resolved] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3098. -- Resolution: Won't Fix In some cases, operation zipWithIndex get a wrong results

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-09-01 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117622#comment-14117622 ] Matei Zaharia commented on SPARK-3098: -- It's true that the ordering of values after

Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Matei Zaharia
Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: What do people think of running the Big

[jira] [Updated] (SPARK-3010) fix redundant conditional

2014-08-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3010: - Assignee: wangfei fix redundant conditional - Key

[jira] [Resolved] (SPARK-3010) fix redundant conditional

2014-08-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3010. -- Resolution: Fixed Fix Version/s: (was: 1.1.0) 1.2.0

[jira] [Updated] (SPARK-3010) fix redundant conditional

2014-08-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3010: - Priority: Trivial (was: Major) fix redundant conditional

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-08-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116923#comment-14116923 ] Matei Zaharia commented on SPARK-: -- The slowdown might be partly due to adding

[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-08-31 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116942#comment-14116942 ] Matei Zaharia commented on SPARK-: -- I see, that makes sense. Large number

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
, Steve Lewis (lordjoe2...@gmail.com) wrote: Is there a sample of how to do this - I see 1.1 is out but cannot find samples of mapPartitions A Java sample would be very useful  On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all

[jira] [Resolved] (SPARK-2889) Spark creates Hadoop Configuration objects inconsistently

2014-08-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2889. -- Resolution: Fixed Fix Version/s: 1.2.0 Spark creates Hadoop Configuration objects

[jira] [Updated] (SPARK-3318) The documentation for addFiles is wrong

2014-08-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3318: - Assignee: Holden Karau The documentation for addFiles is wrong

[jira] [Resolved] (SPARK-3318) The documentation for addFiles is wrong

2014-08-30 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3318. -- Resolution: Fixed Fix Version/s: 1.2.0 The documentation for addFiles is wrong

Re: Mapping Hadoop Reduce to Spark

2014-08-30 Thread Matei Zaharia
In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key-value pairs. Unfortunately sortByKey does not let you control the Partitioner, but it's fairly easy to write your own version that does if this is important. In

[jira] [Updated] (SPARK-3257) Enable :cp to add JARs in spark-shell (Scala 2.11)

2014-08-29 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3257: - Assignee: Heather Miller Enable :cp to add JARs in spark-shell (Scala 2.11

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Matei Zaharia
Personally I'd actually consider putting CDH4 back if there are still users on it. It's always better to be inclusive, and the convenience of a one-click download is high. Do we have a sense on what % of CDH users still use CDH4? Matei On August 28, 2014 at 11:31:13 PM, Sean Owen

Re: Possible to make one executor be able to work on multiple tasks simultaneously?

2014-08-29 Thread Matei Zaharia
Yes, executors run one task per core of your machine by default. You can also manually launch them with more worker threads than you have cores. What cluster manager are you on? Matei On August 29, 2014 at 11:24:33 AM, Victor Tso-Guillen (v...@paxata.com) wrote: I'm thinking of local mode

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114247#comment-14114247 ] Matei Zaharia commented on SPARK-3277: -- Thanks Mridul -- I think Andrew and Patrick

[jira] [Resolved] (SPARK-3239) Choose disks for spilling randomly

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3239. -- Resolution: Fixed Fix Version/s: 1.1.0 Choose disks for spilling randomly

[jira] [Updated] (SPARK-3239) Choose disks for spilling randomly in PySpark

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3239: - Summary: Choose disks for spilling randomly in PySpark (was: Choose disks for spilling randomly

[jira] [Updated] (SPARK-3256) Enable :cp to add JARs in spark-shell (Scala 2.10)

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3256: - Summary: Enable :cp to add JARs in spark-shell (Scala 2.10) (was: Enable :cp to add JARs

[jira] [Created] (SPARK-3257) Enable :cp to add JARs in spark-shell (Scala 2.11)

2014-08-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3257: Summary: Enable :cp to add JARs in spark-shell (Scala 2.11) Key: SPARK-3257 URL: https://issues.apache.org/jira/browse/SPARK-3257 Project: Spark Issue Type

[jira] [Updated] (SPARK-3256) Enable :cp to add JARs in spark-shell (Scala 2.10)

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3256: - Fix Version/s: (was: 1.2.0) Enable :cp to add JARs in spark-shell (Scala 2.10

[jira] [Updated] (SPARK-3256) Enable :cp to add JARs in spark-shell (Scala 2.10)

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3256: - Assignee: Chip Senkbeil Enable :cp to add JARs in spark-shell (Scala 2.10

[jira] [Resolved] (SPARK-3256) Enable :cp to add JARs in spark-shell (Scala 2.10)

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3256. -- Resolution: Fixed Fix Version/s: 1.2.0 Enable :cp to add JARs in spark-shell (Scala

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112791#comment-14112791 ] Matei Zaharia commented on SPARK-3215: -- Hey Marcelo, while this could be useful

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112943#comment-14112943 ] Matei Zaharia commented on SPARK-3215: -- I think we should try this externally first

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113055#comment-14113055 ] Matei Zaharia commented on SPARK-3215: -- As I mentioned above, there's more to it than

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113121#comment-14113121 ] Matei Zaharia commented on SPARK-3215: -- The problem is just how different future

[jira] [Created] (SPARK-3271) Delete unused methods in Utils

2014-08-27 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3271: Summary: Delete unused methods in Utils Key: SPARK-3271 URL: https://issues.apache.org/jira/browse/SPARK-3271 Project: Spark Issue Type: Improvement

[jira] [Resolved] (SPARK-3271) Delete unused methods in Utils

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3271. -- Resolution: Fixed Delete unused methods in Utils

[jira] [Resolved] (SPARK-3265) Allow using custom ipython executable with pyspark

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3265. -- Resolution: Fixed Fix Version/s: (was: 1.0.2) 1.2.0

[jira] [Updated] (SPARK-3265) Allow using custom ipython executable with pyspark

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3265: - Affects Version/s: 1.1.0 Allow using custom ipython executable with pyspark

[jira] [Updated] (SPARK-3271) Delete unused methods in Utils

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3271: - Assignee: wangfei Delete unused methods in Utils

[jira] [Commented] (SPARK-3215) Add remote interface for SparkContext

2014-08-27 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14113294#comment-14113294 ] Matei Zaharia commented on SPARK-3215: -- Okay, so my suggestion is do it separately

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other

Re: SchemaRDD

2014-08-27 Thread Matei Zaharia
I think this will increasingly be its role, though it doesn't make sense to use it to core because it is clearly just a client of the core APIs. What usage do you have in mind in particular? It would be nice to know how the non-SQL APIs for this could be better. Matei On August 27, 2014 at

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Matei Zaharia
You can use spark-shell -i file.scala to run that. However, that keeps the interpreter open at the end, so you need to make your file end with System.exit(0) (or even more robustly, do stuff in a try {} and add that in finally {}). In general it would be better to compile apps and run them

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other

[jira] [Resolved] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3073. -- Resolution: Fixed Fix Version/s: 1.2.0 improve large sort (external sort) for PySpark

[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111631#comment-14111631 ] Matei Zaharia commented on SPARK-2926: -- I see, thanks for posting the benchmarks

[jira] [Updated] (SPARK-3225) Typo in script

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3225: - Priority: Trivial (was: Minor) Typo in script -- Key: SPARK

[jira] [Updated] (SPARK-3225) Typo in script

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3225: - Assignee: WangTaoTheTonic Typo in script -- Key: SPARK-3225

[jira] [Resolved] (SPARK-3225) Typo in script

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3225. -- Resolution: Fixed Fix Version/s: 1.2.0 Typo in script

[jira] [Updated] (SPARK-3240) Document workaround for MESOS-1688

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3240: - Assignee: Martin Weindel Document workaround for MESOS-1688

[jira] [Created] (SPARK-3240) Document workaround for MESOS-1688

2014-08-26 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3240: Summary: Document workaround for MESOS-1688 Key: SPARK-3240 URL: https://issues.apache.org/jira/browse/SPARK-3240 Project: Spark Issue Type: Documentation

[jira] [Resolved] (SPARK-3240) Document workaround for MESOS-1688

2014-08-26 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3240. -- Resolution: Fixed Document workaround for MESOS-1688

Re: spark-ec2 1.0.2 creates EC2 cluster at wrong version

2014-08-26 Thread Matei Zaharia
This shouldn't be a chicken-and-egg problem, since the script fetches the AMI from a known URL. Seems like an issue in publishing this release. On August 26, 2014 at 1:24:45 PM, Shivaram Venkataraman (shiva...@eecs.berkeley.edu) wrote: This is a chicken and egg problem in some sense. We can't

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Matei Zaharia
It should be fixed now. Maybe you have a cached version of the page in your browser. Open DevTools (cmd-shift-I), press the gear icon, and check disable cache while devtools open, then refresh the page to refresh without cache. Matei On August 26, 2014 at 7:31:18 AM, Nicholas Chammas

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though is that the standalone mode grabs the executors' version of Spark code from what's installed on the cluster, while your driver might be built against

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Matei Zaharia
You can use sc.wholeTextFiles to read each file as a complete String, though it requires each file to be small enough for one task to process. On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote: i've seen this done using mapPartitions() where each partition represents a

Re: CUDA in spark, especially in MLlib?

2014-08-26 Thread Matei Zaharia
You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
connect to an existing 1.0.0 cluster and see what what happens... Thanks, Matei :) On Tue, Aug 26, 2014 at 6:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results

2014-08-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110183#comment-14110183 ] Matei Zaharia commented on SPARK-3098: -- Sorry, I don't understand -- what exactly

[jira] [Updated] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2976: - Summary: Replace tabs with spaces (was: Too many ugly tabs instead of white spaces) Replace

[jira] [Updated] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2976: - Assignee: Kousuke Saruta Replace tabs with spaces

[jira] [Resolved] (SPARK-2976) Replace tabs with spaces

2014-08-25 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2976. -- Resolution: Fixed Fix Version/s: 1.2.0 Replace tabs with spaces

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even

Re: saveAsTextFile to s3 on spark does not work, just hangs

2014-08-25 Thread Matei Zaharia
Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One possibility is that your S3 bucket is in a remote Amazon region, which would make it very slow. In my experience though saveAsTextFile has worked even for pretty large datasets in that situation, so maybe

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
Chen (tnac...@gmail.com) wrote: Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha

Re: saveAsTextFile to s3 on spark does not work, just hangs

2014-08-25 Thread Matei Zaharia
the synthetic operation and see if I get the same results or not. Amnon On Mon, Aug 25, 2014 at 11:26 PM, Matei Zaharia [via Apache Spark Developers List] ml-node+s1001551n8000...@n3.nabble.com wrote: Was the original issue with Spark 1.1 (i.e. master branch) or an earlier release? One

Re: Handling stale PRs

2014-08-25 Thread Matei Zaharia
Hey Nicholas, In general we've been looking at these periodically (at least I have) and asking people to close out of date ones, but it's true that the list has gotten fairly large. We should probably have an expiry time of a few months and close them automatically. I agree that it's daunting

Re: spark and matlab

2014-08-25 Thread Matei Zaharia
Have you tried the pipe() operator? It should work if you can launch your script from the command line. Just watch out for any environment variables needed (you can pass them to pipe() as an optional argument if there are some). On August 25, 2014 at 12:41:29 AM, Jaonary Rabarisoa

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Matei Zaharia
It seems to be because you went there with https:// instead of http://. That said, we'll fix it so that it works on both protocols. Matei On August 25, 2014 at 1:56:16 PM, Nick Chammas (nicholas.cham...@gmail.com) wrote: https://spark.apache.org/screencasts/1-first-steps-with-spark.html The

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
Chen tnac...@gmail.com wrote: +1 to have the work around in. I'll be investigating from the Mesos side too. Tim On Sun, Aug 24, 2014 at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, Mesos in coarse-grained mode probably wouldn't work here. It's too bad

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
. This is on nodes with ~15G of memory, on which we have successfully run 8G jobs. On Mon, Aug 25, 2014 at 2:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW it seems to me that even without that patch, you should be getting tasks launched as long as you leave at least 32 MB of memory

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
for this one. Matei On August 25, 2014 at 1:07:15 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: This is kind of weird then, seems perhaps unrelated to this issue (or at least to the way I understood it). Is the problem maybe that Mesos saw 0 MB being freed and didn't re-offer the machine *even

Re: Mesos/Spark Deadlock

2014-08-25 Thread Matei Zaharia
Chen (tnac...@gmail.com) wrote: Hi Matei, I'm going to investigate from both Mesos and Spark side will hopefully have a good long term solution. In the mean time having a work around to start with is going to unblock folks. Tim On Mon, Aug 25, 2014 at 1:08 PM, Matei Zaharia matei.zaha

Re: Mesos/Spark Deadlock

2014-08-24 Thread Matei Zaharia
, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish.   Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei

Re: Mesos/Spark Deadlock

2014-08-24 Thread Matei Zaharia
, coarse-grained mode would be a challenge as we have to constantly remind people to kill their shells as soon as their queries finish.   Am I correct in viewing Mesos in coarse-grained mode as being similar to Spark Standalone's cpu allocation behavior? On Sat, Aug 23, 2014 at 7:16 PM, Matei

Re: Mesos/Spark Deadlock

2014-08-23 Thread Matei Zaharia
Hey Gary, just as a workaround, note that you can use Mesos in coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold onto CPUs for the duration of the job. Matei On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com) wrote: I just wanted to bring up a

Re: why classTag not typeTag?

2014-08-22 Thread Matei Zaharia
TypeTags are unfortunately not thread-safe in Scala 2.10. They were still somewhat experimental at the time so we decided not to use them. If you want though, you can probably design other APIs that pass a TypeTag around (e.g. make a method that takes an RDD[T] but also requires an implicit

Re: Installation On Windows machine

2014-08-22 Thread Matei Zaharia
You should be able to just download / unzip a Spark release and run it on a Windows machine with the provided .cmd scripts, such as bin\spark-shell.cmd. The scripts to launch a standalone cluster (e.g. start-all.sh) won't work on Windows, but you can launch a standalone cluster manually using

[jira] [Created] (SPARK-3091) Add support for caching metadata on Parquet files

2014-08-17 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3091: Summary: Add support for caching metadata on Parquet files Key: SPARK-3091 URL: https://issues.apache.org/jira/browse/SPARK-3091 Project: Spark Issue Type

[jira] [Updated] (SPARK-3085) Use compact data structures in SQL joins

2014-08-17 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3085: - Target Version/s: 1.1.0 Use compact data structures in SQL joins

[jira] [Created] (SPARK-3084) Collect broadcasted tables in parallel in joins

2014-08-16 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-3084: Summary: Collect broadcasted tables in parallel in joins Key: SPARK-3084 URL: https://issues.apache.org/jira/browse/SPARK-3084 Project: Spark Issue Type

[jira] [Updated] (SPARK-3085) Use compact data structures in SQL joins

2014-08-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3085: - Description: We can reuse the CompactBuffer from Spark Core. (was: We can reuse

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
Thanks for sharing this, Brandon! Looks like a great architecture for people to build on. Matei On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote: Hi Spark community, At Adobe Research, we're happy to open source a prototype technology called Spindle we've been

[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-08-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2736: - Priority: Major (was: Minor) Create PySpark RDD from Apache Avro File

[jira] [Commented] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-08-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14097701#comment-14097701 ] Matei Zaharia commented on SPARK-2736: -- I bumped this up to Major because the PR also

[jira] [Resolved] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-08-14 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2736. -- Resolution: Fixed Fix Version/s: 1.1.0 Create PySpark RDD from Apache Avro File

[jira] [Resolved] (SPARK-2983) improve performance of sortByKey()

2014-08-13 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-2983. -- Resolution: Fixed Fix Version/s: 1.1.0 improve performance of sortByKey

Re: Lost executors

2014-08-13 Thread Matei Zaharia
What is your Spark executor memory set to? (You can see it in Spark's web UI at http://driver:4040 under the executors tab). One thing to be aware of is that the JVM never really releases memory back to the OS, so it will keep filling up to the maximum heap size you set. Maybe 4 executors with

[jira] [Commented] (SPARK-2967) Several SQL unit test failed when sort-based shuffle is enabled

2014-08-12 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14093845#comment-14093845 ] Matei Zaharia commented on SPARK-2967: -- Good catch, this is a difference in behavior

Re: DistCP - Spark-based

2014-08-12 Thread Matei Zaharia
Good question; I don't know of one but I believe people at Cloudera had some thoughts of porting Sqoop to Spark in the future, and maybe they'd consider DistCP as part of this effort. I agree it's missing right now. Matei On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)

[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-08-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092418#comment-14092418 ] Matei Zaharia commented on SPARK-2962: -- I thought this was fixed in https

[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091685#comment-14091685 ] Matei Zaharia commented on SPARK-2926: -- Hey Saisai, a couple of questions about

Re: No space left on device

2014-08-09 Thread Matei Zaharia
Your map-only job should not be shuffling, but if you want to see what's running, look at the web UI at http://driver:4040. In fact the job should not even write stuff to disk except inasmuch as the Hadoop S3 library might build up blocks locally before sending them on. My guess is that it's

Welcoming two new committers

2014-08-08 Thread Matei Zaharia
Hi everyone, The PMC recently voted to add two new committers and PMC members: Joey Gonzalez and Andrew Or. Both have been huge contributors in the past year -- Joey on much of GraphX as well as quite a bit of the initial work in MLlib, and Andrew on Spark Core. Join me in welcoming them as

Re: Unit tests in 5 minutes

2014-08-08 Thread Matei Zaharia
Just as a note, when you're developing stuff, you can use test-only in sbt, or the equivalent feature in Maven, to run just some of the tests. This is what I do, I don't wait for Jenkins to run things. 90% of the time if it passes the tests that I know could break stuff, it will pass all of

[jira] [Updated] (SPARK-2887) RDD.countApproxDistinct() is wrong when RDD has more one partition

2014-08-06 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-2887: - Assignee: Davies Liu RDD.countApproxDistinct() is wrong when RDD has more one partition

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Emails sent from Nabble have it, while others don't. Unfortunately I haven't received a reply from ASF infra on this yet. Matei On August 5, 2014 at 2:04:10 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: Looks like this feature has been turned off. Are these changes intentional? Or

Re: Include permalinks in mail footer

2014-08-05 Thread Matei Zaharia
Oh actually sorry, it looks like infra has looked at it but they can't add permalinks. They can only add here's how to unsubscribe footers. My bad, I just didn't catch the email update from them. Matei On August 5, 2014 at 2:39:45 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Emails sent

<    1   2   3   4   5   6   7   8   9   10   >