Anyone has perfect solution for spark source code compilation issue on intellij

2015-11-03 Thread canan chen
Hi folks,

I often meet the spark compilation issue on intellij. It wastes me lots of
time. I googled it and found someone else also meet similar issue, but
seems no perfect solution for now. but still wondering anyone here has
perfect solution for that. The issue happens sometimes, I don't know what
cause this. It even happens sometimes when I import a new copy of clean
spark source code.

I have tried lots of ways (like sbt clean, clean intellij files and
reimport etc ), none of them can resolve this issue.

Here's some error message in intellij.

Error:scala:
 while compiling:
/Users/hadoop/github/spark_2/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala
during phase: jvm
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
-classpath


Re: Master build fails ?

2015-11-03 Thread Jean-Baptiste Onofré

Hi Ted,

thanks for the update. The build with sbt is in progress on my box.

Regards
JB

On 11/03/2015 03:31 PM, Ted Yu wrote:

Interesting, Sbt builds were not all failing:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/

FYI

On Tue, Nov 3, 2015 at 5:58 AM, Jean-Baptiste Onofré > wrote:

Hi Jacek,

it works fine with mvn: the problem is with sbt.

I suspect a different reactor order in sbt compare to mvn.

Regards
JB

On 11/03/2015 02:44 PM, Jacek Laskowski wrote:

Hi,

Just built the sources using the following command and it worked
fine.

➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
-Dhadoop.version=2.7.1 -Dscala-2.11 -Phive -Phive-thriftserver
-DskipTests clean install
...
[INFO]

[INFO] BUILD SUCCESS
[INFO]

[INFO] Total time: 14:15 min
[INFO] Finished at: 2015-11-03T14:40:40+01:00
[INFO] Final Memory: 438M/1972M
[INFO]


➜  spark git:(master) ✗ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

I'm on Mac OS.

Pozdrawiam,
Jacek

--
Jacek Laskowski | http://blog.japila.pl |
http://blog.jaceklaskowski.pl
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Tue, Nov 3, 2015 at 1:37 PM, Jean-Baptiste Onofré
> wrote:

Thanks for the update, I used mvn to build but without hive
profile.

Let me try with mvn with the same options as you and sbt also.

I keep you posted.

Regards
JB

On 11/03/2015 12:55 PM, Jeff Zhang wrote:


I found it is due to SPARK-11073.

Here's the command I used to build

build/sbt clean compile -Pyarn -Phadoop-2.6 -Phive
-Phive-thriftserver
-Psparkr

On Tue, Nov 3, 2015 at 7:52 PM, Jean-Baptiste Onofré

>> wrote:

  Hi Jeff,

  it works for me (with skipping the tests).

  Let me try again, just to be sure.

  Regards
  JB


  On 11/03/2015 11:50 AM, Jeff Zhang wrote:

  Looks like it's due to guava version
conflicts, I see both guava
  14.0.1
  and 16.0.1 under lib_managed/bundles. Anyone
meet this issue too ?

  [error]


/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:26:
  object HashCodes is not a member of package
com.google.common.hash
  [error] import com.google.common.hash.HashCodes
  [error]^
  [info] Resolving
org.apache.commons#commons-math;2.2 ...
  [error]


/Users/jzhang/github/spark_apache/core/src/main/scala/org/apache/spark/SecurityManager.scala:384:
  not found: value HashCodes
  [error] val cookie =
HashCodes.fromBytes(secret).toString()
  [error]  ^




  --
  Best Regards

  Jeff Zhang


  --
  Jean-Baptiste Onofré
jbono...@apache.org 
>
http://blog.nanthrax.net
  Talend - http://www.talend.com


  
-
  To unsubscribe, e-mail:
dev-unsubscr...@spark.apache.org

  >
  For additional commands, e-mail:
dev-h...@spark.apache.org 
  

Re: Implementation of RNN/LSTM in Spark

2015-11-03 Thread Disha Shrivastava
Hi Julio,

Can you please cite references based on the distributed implementation?

On Tue, Nov 3, 2015 at 8:52 PM, Julio Antonio Soto de Vicente <
ju...@esbet.es> wrote:

> Hi,
> Is my understanding that little research has been done yet on distributed
> computation (without access to shared memory) in RNN. I also look forward
> to contributing in this respect.
>
> El 03/11/2015, a las 16:00, Disha Shrivastava 
> escribió:
>
> I would love to work on this and ask for ideas on how it can be done or
> can suggest some papers as starting point. Also, I wanted to know if Spark
> would be an ideal platform to have a distributive implementation for
> RNN/LSTM
>
> On Mon, Nov 2, 2015 at 10:52 AM, Sasaki Kai  wrote:
>
>> Hi, Disha
>>
>> There seems to be no JIRA on RNN/LSTM directly. But there were several
>> tickets about other type of networks regarding deep learning.
>>
>> Stacked Auto Encoder
>> https://issues.apache.org/jira/browse/SPARK-2623
>> CNN
>> https://issues.apache.org/jira/browse/SPARK-9129
>> https://issues.apache.org/jira/browse/SPARK-9273
>>
>> Roadmap of MLlib deep learning
>> https://issues.apache.org/jira/browse/SPARK-5575
>>
>> I think it may be good to join the discussion on SPARK-5575.
>> Best
>>
>> Kai Sasaki
>>
>>
>> On Nov 2, 2015, at 1:59 PM, Disha Shrivastava 
>> wrote:
>>
>> Hi,
>>
>> I wanted to know if someone is working on implementing RNN/LSTM in Spark
>> or has already done. I am also willing to contribute to it and get some
>> guidance on how to go about it.
>>
>> Thanks and Regards
>> Disha
>> Masters Student, IIT Delhi
>>
>>
>>
>


Re: Unchecked contribution (JIRA and PR)

2015-11-03 Thread Jerry Lam
Sergio, you are not alone for sure. Check the RowSimilarity implementation
[SPARK-4823]. It has been there for 6 months. It is very likely those which
don't merge in the version of spark that it was developed will never merged
because spark changes quite significantly from version to version if the
algorithm depends a lot of internal api.

On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:

> Sergio,
>
> Usually it takes a lot of effort to get something merged into Spark
> itself, especially for relatively new algorithms that might not have
> established itself yet. I will leave it to mllib maintainers to comment on
> the specifics of the individual algorithms proposed here.
>
> Just another general comment: we have been working on making packages be
> as easy to use as possible for Spark users. Right now it only requires a
> simple flag to pass to the spark-submit script to include a package.
>
>
> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez  wrote:
>
>> Hello all:
>>
>> I developed two packages for MLlib in March. These have been also upload
>> to the spark-packages repository. Associated to these packages, I created
>> two JIRA's threads and the correspondent pull requests, which are listed
>> below:
>>
>> https://github.com/apache/spark/pull/5184
>> https://github.com/apache/spark/pull/5170
>>
>> https://issues.apache.org/jira/browse/SPARK-6531
>> https://issues.apache.org/jira/browse/SPARK-6509
>>
>> These remain unassigned in JIRA and unverified in GitHub.
>>
>> Could anyone explain why are they in this state yet? Is it normal?
>>
>> Thanks!
>>
>> Sergio R.
>>
>> --
>>
>> Sergio Ramírez Gallego
>> Research group on Soft Computing and Intelligent Information Systems,
>> Dept. Computer Science and Artificial Intelligence,
>> University of Granada, Granada, Spain.
>> Email: srami...@decsai.ugr.es
>> Research Group URL: http://sci2s.ugr.es/
>>
>> -
>>
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a
>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el
>> caso de haber recibido este correo electrónico por error, se ruega
>> notificar inmediatamente esta circunstancia mediante reenvío a la
>> dirección electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its
>> recipient(s). If you are not the indicated recipient, you are informed
>> that reading, using, disseminating and/or copying it without
>> authorisation is forbidden in accordance with the legislation in effect.
>> If you have received this email by mistake, please immediately notify
>> the sender of the situation by resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang
Thanks for your response. I was worried about #3, vs being able to use the
objects directly. #2 seems to be the dealbreaker for my use case right?
Even if it I am using tachyon for caching, if an executor is lost, then
that partition is lost for the purposes of spark?

On Tue, Nov 3, 2015 at 5:53 PM Reynold Xin  wrote:

> I don't think there is any special handling w.r.t. Tachyon vs in-heap
> caching. As a matter of fact, I think the current offheap caching
> implementation is pretty bad, because:
>
> 1. There is no namespace sharing in offheap mode
> 2. Similar to 1, you cannot recover the offheap memory once Spark driver
> or executor crashes
> 3. It requires expensive serialization to go offheap
>
> It would've been simpler to just treat Tachyon as a normal file system,
> and use it that way to at least satisfy 1 and 2, and also substantially
> simplify the internals.
>
>
>
>
> On Tue, Nov 3, 2015 at 7:59 AM, Justin Uang  wrote:
>
>> Yup, but I'm wondering what happens when an executor does get removed,
>> but when we're using tachyon. Will the cached data still be available,
>> since we're using off-heap storage, so the data isn't stored in the
>> executor?
>>
>> On Tue, Nov 3, 2015 at 4:57 PM Ryan Williams <
>> ryan.blake.willi...@gmail.com> wrote:
>>
>>> fwiw, I think that having cached RDD partitions prevents executors from
>>> being removed under dynamic allocation by default; see SPARK-8958
>>> . The
>>> "spark.dynamicAllocation.cachedExecutorIdleTimeout" config
>>> 
>>> controls this.
>>>
>>> On Fri, Oct 30, 2015 at 12:14 PM Justin Uang 
>>> wrote:
>>>
 Hey guys,

 According to the docs for 1.5.1, when an executor is removed for
 dynamic allocation, the cached data is gone. If I use off-heap storage like
 tachyon, conceptually there isn't this issue anymore, but is the cached
 data still available in practice? This would be great because then we would
 be able to set spark.dynamicAllocation.cachedExecutorIdleTimeout to be
 quite small.

 ==
 In addition to writing shuffle files, executors also cache data either
 on disk or in memory. When an executor is removed, however, all cached data
 will no longer be accessible. There is currently not yet a solution for
 this in Spark 1.2. In future releases, the cached data may be preserved
 through an off-heap storage similar in spirit to how shuffle files are
 preserved through the external shuffle service.
 ==

>>>
>


Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang
Alright, we'll just stick with normal caching then.

Just for future reference, how much work would it be to get it to retain
the partitions in tachyon. This is especially helpful in a multitenant
situation, where many users each have their own persistent spark contexts,
but where the notebooks can be idle for long periods of time while holding
onto cached rdds.

On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin  wrote:

> It is lost unfortunately (although can be recomputed automatically).
>
>
> On Tue, Nov 3, 2015 at 1:13 PM, Justin Uang  wrote:
>
>> Thanks for your response. I was worried about #3, vs being able to use
>> the objects directly. #2 seems to be the dealbreaker for my use case right?
>> Even if it I am using tachyon for caching, if an executor is lost, then
>> that partition is lost for the purposes of spark?
>>
>> On Tue, Nov 3, 2015 at 5:53 PM Reynold Xin  wrote:
>>
>>> I don't think there is any special handling w.r.t. Tachyon vs in-heap
>>> caching. As a matter of fact, I think the current offheap caching
>>> implementation is pretty bad, because:
>>>
>>> 1. There is no namespace sharing in offheap mode
>>> 2. Similar to 1, you cannot recover the offheap memory once Spark driver
>>> or executor crashes
>>> 3. It requires expensive serialization to go offheap
>>>
>>> It would've been simpler to just treat Tachyon as a normal file system,
>>> and use it that way to at least satisfy 1 and 2, and also substantially
>>> simplify the internals.
>>>
>>>
>>>
>>>
>>> On Tue, Nov 3, 2015 at 7:59 AM, Justin Uang 
>>> wrote:
>>>
 Yup, but I'm wondering what happens when an executor does get removed,
 but when we're using tachyon. Will the cached data still be available,
 since we're using off-heap storage, so the data isn't stored in the
 executor?

 On Tue, Nov 3, 2015 at 4:57 PM Ryan Williams <
 ryan.blake.willi...@gmail.com> wrote:

> fwiw, I think that having cached RDD partitions prevents executors
> from being removed under dynamic allocation by default; see SPARK-8958
> . The
> "spark.dynamicAllocation.cachedExecutorIdleTimeout" config
> 
> controls this.
>
> On Fri, Oct 30, 2015 at 12:14 PM Justin Uang 
> wrote:
>
>> Hey guys,
>>
>> According to the docs for 1.5.1, when an executor is removed for
>> dynamic allocation, the cached data is gone. If I use off-heap storage 
>> like
>> tachyon, conceptually there isn't this issue anymore, but is the cached
>> data still available in practice? This would be great because then we 
>> would
>> be able to set spark.dynamicAllocation.cachedExecutorIdleTimeout to be
>> quite small.
>>
>> ==
>> In addition to writing shuffle files, executors also cache data
>> either on disk or in memory. When an executor is removed, however, all
>> cached data will no longer be accessible. There is currently not yet a
>> solution for this in Spark 1.2. In future releases, the cached data may 
>> be
>> preserved through an off-heap storage similar in spirit to how shuffle
>> files are preserved through the external shuffle service.
>> ==
>>
>
>>>
>


Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond
to this email explaining why you use it over the coarse grained mode?

Thanks.


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform
because the interactive session could be long and might not use Spark for
the entire session. It is very wasteful of resources if we used the
coarse-grained mode because it keeps resource for the entire session.
Therefore, fine-grained mode was used.

Knowing that Spark now supports dynamic resource allocation with coarse
grained mode, we were thinking about using it. However, we decided to
switch to Yarn because in addition to dynamic allocation, it has better
supports on security.

On Tue, Nov 3, 2015 at 7:22 PM, Soren Macbeth  wrote:

> we use fine-grained mode. coarse-grained mode keeps JVMs around which
> often leads to OOMs, which in turn kill the entire executor, causing entire
> stages to be retried. In fine-grained mode, only the task fails and
> subsequently gets retried without taking out an entire stage or worse.
>
> On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin  wrote:
>
>> If you are using Spark with Mesos fine grained mode, can you please
>> respond to this email explaining why you use it over the coarse grained
>> mode?
>>
>> Thanks.
>>
>>
>


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Soren Macbeth
we use fine-grained mode. coarse-grained mode keeps JVMs around which often
leads to OOMs, which in turn kill the entire executor, causing entire
stages to be retried. In fine-grained mode, only the task fails and
subsequently gets retried without taking out an entire stage or worse.

On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin  wrote:

> If you are using Spark with Mesos fine grained mode, can you please
> respond to this email explaining why you use it over the coarse grained
> mode?
>
> Thanks.
>
>


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
Soren,

If I understand how Mesos works correctly, even the fine grained mode keeps
the JVMs around?


On Tue, Nov 3, 2015 at 4:22 PM, Soren Macbeth  wrote:

> we use fine-grained mode. coarse-grained mode keeps JVMs around which
> often leads to OOMs, which in turn kill the entire executor, causing entire
> stages to be retried. In fine-grained mode, only the task fails and
> subsequently gets retried without taking out an entire stage or worse.
>
> On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin  wrote:
>
>> If you are using Spark with Mesos fine grained mode, can you please
>> respond to this email explaining why you use it over the coarse grained
>> mode?
>>
>> Thanks.
>>
>>
>


[VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-03 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.2
[ ] -1 Do not release this package because ...


The release fixes 59 known issues in Spark 1.5.1, listed here:
http://s.apache.org/spark-1.5.2

The tag to be voted on is v1.5.2-rc2:
https://github.com/apache/spark/releases/tag/v1.5.2-rc2

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
- as version 1.5.2-rc2:
https://repository.apache.org/content/repositories/orgapachespark-1153
- as version 1.5.2:
https://repository.apache.org/content/repositories/orgapachespark-1152

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.


What justifies a -1 vote for this release?

-1 vote should occur for regressions from Spark 1.5.1. Bugs already present
in 1.5.1 will not block this release.


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Timothy Chen
Fine grain mode does reuse the same JVM but perhaps different placement or 
different allocated cores comparing to the same total memory allocation.

Tim

Sent from my iPhone

> On Nov 3, 2015, at 6:00 PM, Reynold Xin  wrote:
> 
> Soren,
> 
> If I understand how Mesos works correctly, even the fine grained mode keeps 
> the JVMs around?
> 
> 
>> On Tue, Nov 3, 2015 at 4:22 PM, Soren Macbeth  wrote:
>> we use fine-grained mode. coarse-grained mode keeps JVMs around which often 
>> leads to OOMs, which in turn kill the entire executor, causing entire stages 
>> to be retried. In fine-grained mode, only the task fails and subsequently 
>> gets retried without taking out an entire stage or worse. 
>> 
>>> On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin  wrote:
>>> If you are using Spark with Mesos fine grained mode, can you please respond 
>>> to this email explaining why you use it over the coarse grained mode?
>>> 
>>> Thanks.
> 


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW
Hi,
We are using Mesos fine grained mode because we can have multiple instances of 
spark to share machines and each application get resources dynamically 
allocated.  Thanks & Regards,  Meethu M 


 On Wednesday, 4 November 2015 5:24 AM, Reynold Xin  
wrote:
   

 If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?
Thanks.


  

Re: Info about Dataset

2015-11-03 Thread Sandy Ryza
Hi Justin,

The Dataset API proposal is available here:
https://issues.apache.org/jira/browse/SPARK-.

-Sandy

On Tue, Nov 3, 2015 at 1:41 PM, Justin Uang  wrote:

> Hi,
>
> I was looking through some of the PRs slated for 1.6.0 and I noted
> something called a Dataset, which looks like a new concept based off of the
> scaladoc for the class. Can anyone point me to some references/design_docs
> regarding the choice to introduce the new concept? I presume it is probably
> something to do with performance optimizations?
>
> Thanks!
>
> Justin
>


Getting new metrics into /api/v1

2015-11-03 Thread Charles Yeh
Hello,

I'm trying to get maxCores and memoryPerExecutorMB into /api/v1 for this
ticket: https://issues.apache.org/jira/browse/SPARK-10565

I can't figure out which *getApplicationInfoList *is used by
*ApiRootResource.scala.
*It's attached in SparkUI but SparkUI's doesn't have start / end times
and /api/v1/applications does.

It looks like:

   - *MasterWebUI.scala* has these fields, since it has the applications
   themselves
   - *HistoryServer.scala *doesn't have these fields, since it infers them
   from logs
   - *SparkUI.scala *looks like a mock since it doesn't have end time /
   user / attempt id either