Anyone else having trouble with replicated off heap RDD persistence?

2016-08-16 Thread Michael Allman
Hello, A coworker was having a problem with a big Spark job failing after several hours when one of the executors would segfault. That problem aside, I speculated that her job would be more robust against these kinds of executor crashes if she used replicated RDD storage. She's using off heap s

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
any merit would be to run with > Scala 2.11.2. I'll copy this to JIRA for continuation. > > On Fri, Apr 17, 2015 at 10:31 PM, Michael Allman wrote: >> H... I don't follow. The 2.11.x series is supposed to be binary >> compatible against user code. Anyway, I was bu

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
a > 2.11.{x < 6} would have similar failures. It's not not-ready; it's > just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the > unofficial support to at least make the latest Scala 2.11 the unbroken > one. > > On Fri, Apr 17, 2015 at 7:58 AM, Micha

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing som

independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Allman
Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with "use ") is shared across all connections. So changing the database on one connection changes the database for all connections. T

Re: [SQL] Set Parquet block size?

2014-10-09 Thread Michael Allman
Hi Pierre, I'm setting parquet (and hdfs) block size like follows: val ONE_GB = 1024 * 1024 * 1024 sc.hadoopConfiguration.setInt("dfs.blocksize", ONE_GB) sc.hadoopConfiguration.setInt("parquet.block.size", ONE_GB) Here, sc is a reference to the spark context. I've tested this and it

Re: Support for Parquet V2 in ParquetTableSupport?

2014-10-08 Thread Michael Allman
oping to do some upgrades of our parquet support in the near future. > > On Tue, Oct 7, 2014 at 10:33 PM, Michael Allman wrote: > Hello, > > I was interested in testing Parquet V2 with Spark SQL, but noticed after some > investigation that the parquet writer that Spark SQL use

Re: Interactive interface tool for spark

2014-10-08 Thread Michael Allman
Ummm... what's helium? Link, plz? On Oct 8, 2014, at 9:13 AM, Stephen Boesch wrote: > @kevin, Michael, > Second that: interested in seeing the zeppelin. pls use helium though .. > > 2014-10-08 7:57 GMT-07:00 Michael Allman : > Hi Andy, > > This sounds awes

Re: Interactive interface tool for spark

2014-10-08 Thread Michael Allman
Hi Andy, This sounds awesome. Please keep us posted. Meanwhile, can you share a link to your project? I wasn't able to find it. Cheers, Michael On Oct 8, 2014, at 3:38 AM, andy petrella wrote: > Heya > > You can check Zeppellin or my fork of the Scala notebook. > I'm going this week end to

Re: window every n elements instead of time based

2014-10-07 Thread Michael Allman
uld be that it breaks the concept of window operations which are in > Spark. > > Thanks, > Jayant > > > > > On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <[hidden email]> wrote: > Hi Andrew, > > The use case I have in mind is batch data serialization

Support for Parquet V2 in ParquetTableSupport?

2014-10-07 Thread Michael Allman
Hello, I was interested in testing Parquet V2 with Spark SQL, but noticed after some investigation that the parquet writer that Spark SQL uses is fixed at V1 here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350. An

Re: window every n elements instead of time based

2014-10-07 Thread Michael Allman
elp people interested in count-based > windowing to understand the state of the feature in Spark Streaming. > > Thanks! > Andrew > > On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman wrote: > Hi, > > I also have a use for count-based windowing. I'd like to proce

Re: window every n elements instead of time based

2014-10-03 Thread Michael Allman
Hi, I also have a use for count-based windowing. I'd like to process data batches by size as opposed to time. Is this feature on the development roadmap? Is there a JIRA ticket for it? Thank you, Michael -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/win

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Michael Allman
Spark and Oryx implementations? Would be good to be clear on them, and also > see if there are further tricks/enhancements from the Oryx one that can be > ported (such as the lambda * numRatings adjustment). > > N > > > On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hi

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
I just ran a runtime performance comparison between 0.9.0-incubating and your als branch. I saw a 1.5x improvement in performance. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html Sent from the Apach

Re: possible bug in Spark's ALS implementation...

2014-03-18 Thread Michael Allman
Hi Xiangrui, I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can you explain? Also, thanks for addressing the issue with factor matrix persistence in PR 165. I was probably not going to get to that for a while. I will try to test your changes today for speed improvements

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Michael Allman
I've created https://spark-project.atlassian.net/browse/SPARK-1263 to address the issue of the factor matrix recomputation. I'm planning to submit a related pull request shortly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-imp

Re: possible bug in Spark's ALS implementation...

2014-03-17 Thread Michael Allman
You are correct, in the long run it doesn't matter which matrix you begin the iterative process with. I was thinking in terms of doing a side-by-side comparison to Oryx. I've posted a bug report as SPARK-1262. I described the problem I found and the mitigation strategy I've used. I think that this

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Michael Allman
I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank > 1 or a large number of iterations. I w

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Michael Allman
Hi Sean, Digging deeper I've found another difference between Oryx's implementation and Spark's. Why do you adjust lambda here? https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java#L491 Cheers, Michael

Re: possible bug in Spark's ALS implementation...

2014-03-12 Thread Michael Allman
Thank you everyone for your feedback. It's been very helpful, and though I still haven't found the cause of the difference between Spark and Oryx, I feel I'm making progress. Xiangrui asked me to create a ticket for this issue. The reason I didn't do this originally is because it's not clear to me

possible bug in Spark's ALS implementation...

2014-03-11 Thread Michael Allman
Hi, I'm implementing a recommender based on the algorithm described in http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the basis for Spark's ALS implementation for data sets with implicit features. The data set I'm working with is proprietary and I cannot share it, howe

is spark.cleaner.ttl safe?

2014-03-11 Thread Michael Allman
Hello, I've been trying to run an iterative spark job that spills 1+ GB to disk per iteration on a system with limited disk space. I believe there's enough space if spark would clean up unused data from previous iterations, but as it stands the number of iterations I can run is limited by ava