Hello,
A coworker was having a problem with a big Spark job failing after several
hours when one of the executors would segfault. That problem aside, I
speculated that her job would be more robust against these kinds of executor
crashes if she used replicated RDD storage. She's using off heap s
any merit would be to run with
> Scala 2.11.2. I'll copy this to JIRA for continuation.
>
> On Fri, Apr 17, 2015 at 10:31 PM, Michael Allman wrote:
>> H... I don't follow. The 2.11.x series is supposed to be binary
>> compatible against user code. Anyway, I was bu
a
> 2.11.{x < 6} would have similar failures. It's not not-ready; it's
> just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the
> unofficial support to at least make the latest Scala 2.11 the unbroken
> one.
>
> On Fri, Apr 17, 2015 at 7:58 AM, Micha
FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing som
Hello,
We're running a spark sql thriftserver that several users connect to with
beeline. One limitation we've run into is that the current working database
(set with "use ") is shared across all connections. So changing the
database on one connection changes the database for all connections. T
Hi Pierre,
I'm setting parquet (and hdfs) block size like follows:
val ONE_GB = 1024 * 1024 * 1024
sc.hadoopConfiguration.setInt("dfs.blocksize", ONE_GB)
sc.hadoopConfiguration.setInt("parquet.block.size", ONE_GB)
Here, sc is a reference to the spark context. I've tested this and it
oping to do some upgrades of our parquet support in the near future.
>
> On Tue, Oct 7, 2014 at 10:33 PM, Michael Allman wrote:
> Hello,
>
> I was interested in testing Parquet V2 with Spark SQL, but noticed after some
> investigation that the parquet writer that Spark SQL use
Ummm... what's helium? Link, plz?
On Oct 8, 2014, at 9:13 AM, Stephen Boesch wrote:
> @kevin, Michael,
> Second that: interested in seeing the zeppelin. pls use helium though ..
>
> 2014-10-08 7:57 GMT-07:00 Michael Allman :
> Hi Andy,
>
> This sounds awes
Hi Andy,
This sounds awesome. Please keep us posted. Meanwhile, can you share a link to
your project? I wasn't able to find it.
Cheers,
Michael
On Oct 8, 2014, at 3:38 AM, andy petrella wrote:
> Heya
>
> You can check Zeppellin or my fork of the Scala notebook.
> I'm going this week end to
uld be that it breaks the concept of window operations which are in
> Spark.
>
> Thanks,
> Jayant
>
>
>
>
> On Tue, Oct 7, 2014 at 10:19 PM, Michael Allman <[hidden email]> wrote:
> Hi Andrew,
>
> The use case I have in mind is batch data serialization
Hello,
I was interested in testing Parquet V2 with Spark SQL, but noticed after some
investigation that the parquet writer that Spark SQL uses is fixed at V1 here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350.
An
elp people interested in count-based
> windowing to understand the state of the feature in Spark Streaming.
>
> Thanks!
> Andrew
>
> On Fri, Oct 3, 2014 at 4:09 PM, Michael Allman wrote:
> Hi,
>
> I also have a use for count-based windowing. I'd like to proce
Hi,
I also have a use for count-based windowing. I'd like to process data
batches by size as opposed to time. Is this feature on the development
roadmap? Is there a JIRA ticket for it?
Thank you,
Michael
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/win
Spark and Oryx implementations? Would be good to be clear on them, and also
> see if there are further tricks/enhancements from the Oryx one that can be
> ported (such as the lambda * numRatings adjustment).
>
> N
>
>
> On Sat, Mar 15, 2014 at 2:52 AM, Michael Allman <[hi
I just ran a runtime performance comparison between 0.9.0-incubating and your
als branch. I saw a 1.5x improvement in performance.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tp2567p2823.html
Sent from the Apach
Hi Xiangrui,
I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can
you explain?
Also, thanks for addressing the issue with factor matrix persistence in PR
165. I was probably not going to get to that for a while.
I will try to test your changes today for speed improvements
I've created https://spark-project.atlassian.net/browse/SPARK-1263 to address
the issue of the factor matrix recomputation. I'm planning to submit a
related pull request shortly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-imp
You are correct, in the long run it doesn't matter which matrix you begin the
iterative process with. I was thinking in terms of doing a side-by-side
comparison to Oryx.
I've posted a bug report as SPARK-1262. I described the problem I found and
the mitigation strategy I've used. I think that this
I've been thoroughly investigating this issue over the past couple of days
and have discovered quite a bit. For one thing, there is definitely (at
least) one issue/bug in the Spark implementation that leads to incorrect
results for models generated with rank > 1 or a large number of iterations.
I w
Hi Sean,
Digging deeper I've found another difference between Oryx's implementation
and Spark's. Why do you adjust lambda here?
https://github.com/cloudera/oryx/blob/master/als-common/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java#L491
Cheers,
Michael
Thank you everyone for your feedback. It's been very helpful, and though I
still haven't found the cause of the difference between Spark and Oryx, I
feel I'm making progress.
Xiangrui asked me to create a ticket for this issue. The reason I didn't do
this originally is because it's not clear to me
Hi,
I'm implementing a recommender based on the algorithm described in
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
basis for Spark's ALS implementation for data sets with implicit features.
The data set I'm working with is proprietary and I cannot share it,
howe
Hello,
I've been trying to run an iterative spark job that spills 1+ GB to disk
per iteration on a system with limited disk space. I believe there's
enough space if spark would clean up unused data from previous iterations,
but as it stands the number of iterations I can run is limited by
ava
23 matches
Mail list logo