Re: SPARK-13843 and future of streaming backends

2016-03-19 Thread Hari Shreedharan
I have worked with various ASF projects for 4+ years now. Sure, ASF
projects can delete code as they feel fit. But this is the first time I
have really seen code being "moved out" of a project without discussion. I
am sure you can do this without violating ASF policy, but the explanation
for that would be convoluted (someone decided to make a copy and then the
ASF project deleted it?).

Also, moving the code out would break compatibility. AFAIK, there is no way
to push org.apache.* artifacts directly to maven central. That happens via
mirroring from the ASF maven repos. Even if it you could somehow directly
push the artifacts to mvn, you really can push to org.apache.* groups only
if you are part of the repo and acting as an agent of that project (which
in this case would be Apache Spark). Once you move the code out, even a
committer/PMC member would not be representing the ASF when pushing the
code. I am not sure if there is a way to fix this issue.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan 
wrote:

> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it
> wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger 
> wrote:
> > Why would a PMC vote be necessary on every code deletion?
> >
> > There was a Jira and pull request discussion about the submodules that
> > have been removed so far.
> >
> > https://issues.apache.org/jira/browse/SPARK-13843
> >
> > There's another ongoing one about Kafka specifically
> >
> > https://issues.apache.org/jira/browse/SPARK-13877
> >
> >
> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan 
> wrote:
> >>
> >> I was not aware of a discussion in Dev list about this - agree with
> most of
> >> the observations.
> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>
> >> Regards
> >> Mridul
> >>
> >>
> >>
> >> On Thursday, March 17, 2016, Marcelo Vanzin 
> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> Recently a lot of the streaming backends were moved to a separate
> >>> project on github and removed from the main Spark repo.
> >>>
> >>> While I think the idea is great, I'm a little worried about the
> >>> execution. Some concerns were already raised on the bug mentioned
> >>> above, but I'd like to have a more explicit discussion about this so
> >>> things don't fall through the cracks.
> >>>
> >>> Mainly I have three concerns.
> >>>
> >>> i. Ownership
> >>>
> >>> That code used to be run by the ASF, but now it's hosted in a github
> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>> problematic.
> >>>
> >>> ii. Governance
> >>>
> >>> Similar to the above; who has commit access to the above repos? Will
> >>> all the Spark committers, present and future, have commit access to
> >>> all of those repos? Are they still going to be considered part of
> >>> Spark and have release management done through the Spark community?
> >>>
> >>>
> >>> For both of the questions above, why are they not turned into
> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>> a mechanism to do that, without the need to keep the code in the main
> >>> Spark repo, right?
> >>>
> >>> iii. Usability
> >>>
> >>> This is another thing I don't see discussed. For Scala-based code
> >>> things don't change much, I guess, if the artifact names don't change
> >>> (another reason to keep things in the ASF?), but what about python?
> >>> How are pyspark users expected to get that code going forward, since
> >>> it's not in Spark's pyspark.zip anymore?
> >>>
> >>>
> >>> Is there an easy way of keeping these things within the ASF Spark
> >>> project? I think that would be better for everybody.
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-09 Thread Hari Shreedharan
The credentials file approach (using keytab for spark apps) will only
update HDFS tokens. YARN's AMRM tokens should be taken care of by YARN
internally.

Steve - correct me if I am wrong here: If the AMRM tokens are disappearing
it might be a YARN bug (does the AMRM token have a 7 day limit as well? I
thought that was only for HDFS).


Thanks,
Hari

On Tue, Feb 9, 2016 at 8:44 AM, Steve Loughran 
wrote:

>
> On 9 Feb 2016, at 11:26, Steve Loughran  wrote:
>
>
> On 9 Feb 2016, at 05:55, Prabhu Joseph  wrote:
>
> + Spark-Dev
>
> On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph  > wrote:
>
>> Hi All,
>>
>> A long running Spark job on YARN throws below exception after running
>> for few days.
>>
>> yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row.
>> org.apache.hadoop.yarn.exceptions.YarnException: *No AMRMToken found* for
>> user prabhu at org.apache.hadoop.yarn.ipc.RPC
>> Util.getRemoteException(RPCUtil.java:45)
>>
>> Do any of the below renew the AMRMToken and solve the issue
>>
>> 1. yarn-resourcemanager.delegation.token.max-lifetime increase from 7 days
>>
>> 2. Configuring Proxy user:
>>
>>  hadoop.proxyuser.yarn.hosts *
>> 
>>  hadoop.proxyuser.yarn.groups *
>> 
>>
>
> wouldnt do that: security issues
>
>
>> 3. Can Spark-1.4.0 handle with fix
>> https://issues.apache.org/jira/browse/SPARK-5342
>>
>> spark.yarn.credentials.file
>>
>>
>>
> I'll say "maybe" there
>
>
> uprated to a no, having looked at the code more
>
>
> How to renew the AMRMToken for a long running job on YARN?
>>
>>
>>
>
> AMRM token renewal should be automatic in AM; Yarn sends a message to the
> AM (actually an allocate() response with no containers but a new token at
> the tail of the message.
>
> i don't see any logging in the Hadoopp code there (AMRMClientImpl); filed
> YARN-4682 to add a log statement
>
> if someone other than me were to supply a patch to that JIRA to add a log
> statement *by the end of the day* I'll review it and get it in to Hadoop 2.8
>
>
> like I said: I'll get this in to hadoop-2.8 if someone is timely with the
> diff
>
>


Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-09 Thread Hari Shreedharan
+1, much better than having a new PR each time to fix something for scala-2.11 
every time a patch breaks it.

Thanks,
Hari Shreedharan




> On Oct 9, 2015, at 11:47 AM, Michael Armbrust <mich...@databricks.com> wrote:
> 
> How about just fixing the warning? I get it; it doesn't stop this from
> happening again, but still seems less drastic than tossing out the
> whole mechanism.
> 
> +1
> 
> It also does not seem that expensive to test only compilation for Scala 2.11 
> on PR builds. 



Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Hari Shreedharan
+1. Build looks good, ran a couple apps on YARN


Thanks,
Hari

On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote:

 Sean,

 Can you add -Phive -Phive-thriftserver and try those Hive tests?

 Thanks,

 Yin

 On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote:

 Everything checks out again, and the tests pass for me on Ubuntu +
 Java 7 with '-Pyarn -Phadoop-2.6', except that I always get
 SparkSubmitSuite errors like ...

 - success sanity check *** FAILED ***
   java.lang.RuntimeException: [download failed:
 org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed:
 commons-net#commons-net;3.1!commons-net.jar]
   at
 org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978)
   at
 org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62)
   ...

 I also can't get hive tests to pass. Is anyone else seeing anything
 like this? if not I'll assume this is something specific to the env --
 or that I don't have the build invocation just right. It's puzzling
 since it's so consistent, but I presume others' tests pass and Jenkins
 does.


 On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.4.0!
 
  The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
  22596c534a38cfdda91aef18aa9037ab101e4251
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  [published as version: 1.4.0]
  https://repository.apache.org/content/repositories/orgapachespark-/
  [published as version: 1.4.0-rc4]
  https://repository.apache.org/content/repositories/orgapachespark-1112/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
 
  Please vote on releasing this package as Apache Spark 1.4.0!
 
  The vote is open until Saturday, June 06, at 05:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.4.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == What has changed since RC3 ==
  In addition to may smaller fixes, three blocker issues were fixed:
  4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
  metadataHive get constructed too early
  6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
  78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.3 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.4 QA period,
  so -1 votes should only occur for significant regressions from 1.3.1.
  Bugs already present in 1.3.X, minor regressions, or bugs related
  to new features will not block this release.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Hari Shreedharan
Ah, ok. It was missing in the list of jiras. So +1.




Thanks, Hari

On Mon, Apr 6, 2015 at 11:36 AM, Patrick Wendell pwend...@gmail.com
wrote:

 I believe TD just forgot to set the fix version on the JIRA. There is
 a fix for this in 1.3:
 https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e
 - Patrick
 On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra m...@clearstorydata.com wrote:
 Is that correct, or is the JIRA just out of sync, since TD's PR was merged?
 https://github.com/apache/spark/pull/5008

 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:

 It does not look like https://issues.apache.org/jira/browse/SPARK-6222
 made it. It was targeted towards this release.




 Thanks, Hari

 On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon
 brennon.y...@capitalone.com wrote:

  +1 (non-binding)
  Tested GraphX, build infrastructure,  core test suite on OSX 10.9 w/
  Java
  1.7/1.8
  On 4/6/15, 5:21 AM, Sean Owen so...@cloudera.com wrote:
 SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just
 resolved it for 1.4 anyway. False alarm there.
 
 I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick
 it up if there's another RC, but by itself is not something that needs
 a new RC. (I will give the same treatment to branch 1.2 if needed in
 light of the 1.2.2 release.)
 
 I applied the simple change in SPARK-6205 in order to continue
 executing tests and all was well. I still see a few failures in Hive
 tests:
 
 - show_create_table_serde *** FAILED ***
 - show_tblproperties *** FAILED ***
 - udf_std *** FAILED ***
 - udf_stddev *** FAILED ***
 
 with ...
 
 mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0
 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive
 -Phive-0.13.1 -Dhadoop.version=2.6.0 test
 
 ... but these are not regressions from 1.3.0.
 
 +1 from me at this point on the current artifacts.
 
 On Sun, Apr 5, 2015 at 9:24 AM, Sean Owen so...@cloudera.com wrote:
  Signatures and hashes are good.
  LICENSE, NOTICE still check out.
  Compiles for a Hadoop 2.6 + YARN + Hive profile.
 
  I still see the UISeleniumSuite test failure observed in 1.3.0, which
  is minor and already fixed. I don't know why I didn't back-port it:
  https://issues.apache.org/jira/browse/SPARK-6205
 
  If we roll another, let's get this easy fix in, but it is only an
  issue with tests.
 
 
  On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
  all look legitimate (e.g. reopened or in progress)
 
 
  There is 1 open Blocker for 1.3.1 per Andrew:
  https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
  start even when spark was built in Windows
 
  I believe this can be resolved quickly but as a matter of hygiene
  should be fixed or demoted before release.
 
 
  FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
  examining before release to see how critical they are:
 
  SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
  application,,Open,4/3/15
  SPARK-6484,Ganglia metrics xml reporter doesn't escape
  correctly,Josh Rosen,Open,3/24/15
  SPARK-6270,Standalone Master hangs when streaming job
 completes,,Open,3/11/15
  SPARK-6209,ExecutorClassLoader can leak connections after failing to
  load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
  SPARK-5113,Audit and document use of hostnames and IP addresses in
  Spark,,Open,3/24/15
  SPARK-5098,Number of running tasks become negative after tasks
  lost,,Open,1/14/15
  SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
  Wendell,Reopened,3/23/15
  SPARK-4922,Support dynamic allocation for coarse-grained
 Mesos,,Open,3/31/15
  SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
  instances,,Open,1/27/15
  SPARK-4879,Missing output partitions after job completes with
  speculative execution,Josh Rosen,Open,3/5/15
  SPARK-4751,Support dynamic allocation for standalone mode,Andrew
  Or,Open,12/22/14
  SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
  SPARK-4452,Shuffle data structures can starve others on the same
  thread for memory,Tianshuo Deng,Open,1/24/15
  SPARK-4352,Incorporate locality preferences in dynamic allocation
  requests,,Open,1/26/15
  SPARK-4227,Document external shuffle service,,Open,3/23/15
  SPARK-3650,Triangle Count handles reverse edges
 incorrectly,,Open,2/23/15
 
  On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.3.1!
 
  The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 

  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f3
 1b713ed90bcec63ebc4e530cbb69851
 
  The list of fixes present in this release can be found at:
  http://bit.ly/1C2nVPY
 
  The release files, including signatures, digests, etc. can be found
  at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc1/
 
  Release

Re: PR Builder timing out due to ivy cache lock

2015-03-13 Thread Hari Shreedharan
Here you are:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull

On Fri, Mar 13, 2015 at 11:58 AM, shane knapp skn...@berkeley.edu wrote:

 link to a build, please?

 On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 Looks like something is causing the PR Builder to timeout since this
 morning with the ivy cache being locked.

 Any idea what is happening?





Re: Improving metadata in Spark JIRA

2015-02-06 Thread Hari Shreedharan
+1. Jira cleanup would be good. Please let me know if I can help in some way!




Thanks, Hari

On Fri, Feb 6, 2015 at 11:56 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

 Do we need some new components to be added to the JIRA project?
 Like:
-
scheduler
 -
YARN
 - spark-submit
- …?
 Nick
 ​
 On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 +9000 on cleaning up JIRA.

 Thank you Sean for laying out some specific things to tackle. I will
 assist with this.

 Regarding email, I think Sandy is right. I only get JIRA email for issues
 I'm watching.

 Nick

 On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com
 wrote:

 JIRA updates don't go to this list, they go to iss...@spark.apache.org.
 I
 don't think many are signed up for that list, and those that are probably
 have a flood of emails anyway.

 So I'd definitely be in favor of any JIRA cleanup that you're up for.

 -Sandy

 On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote:

  I've wasted no time in wielding the commit bit to complete a number of
  small, uncontroversial changes. I wouldn't commit anything that didn't
  already appear to have review, consensus and little risk, but please
  let me know if anything looked a little too bold, so I can calibrate.
 
 
  Anyway, I'd like to continue some small house-cleaning by improving
  the state of JIRA's metadata, in order to let it give us a little
  clearer view on what's happening in the project:
 
  a. Add Component to every (open) issue that's missing one
  b. Review all Critical / Blocker issues to de-escalate ones that seem
  obviously neither
  c. Correct open issues that list a Fix version that has already been
  released
  d. Close all issues Resolved for a release that has already been
 released
 
  The problem with doing so is that it will create a tremendous amount
  of email to the list, like, several hundred. It's possible to make
  bulk changes and suppress e-mail though, which could be done for all
  but b.
 
  Better to suppress the emails when making such changes? or just not
  bother on some of these?
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Welcoming three new committers

2015-02-03 Thread Hari Shreedharan
Congrats Cheng, Joseph and Owen! Well done!




Thanks, Hari

On Tue, Feb 3, 2015 at 2:55 PM, Ted Yu yuzhih...@gmail.com wrote:

 Congratulations, Cheng, Joseph and Sean.
 On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:
 Congratulations guys!

 On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia matei.zaha...@gmail.com
 wrote:

  Hi all,
 
  The PMC recently voted to add three new committers: Cheng Lian, Joseph
  Bradley and Sean Owen. All three have been major contributors to Spark in
  the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and
 many
  pieces throughout Spark Core. Join me in welcoming them as committers!
 
  Matei
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


Re: Which committers care about Kafka?

2014-12-24 Thread Hari Shreedharan
In general such discussions happen or is posted on the dev lists. Could you 
please post a summary? Thanks.



Thanks, Hari

On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
wrote:

 After a long talk with Patrick and TD (thanks guys), I opened the following
 jira
 https://issues.apache.org/jira/browse/SPARK-4964
 Sample PR has an impementation for the batch and the dstream case, and a
 link to a project with example usage.
 On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote:
 yup, we at tresata do the idempotent store the same way. very simple
 approach.

 On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That KafkaRDD code is dead simple.

 Given a user specified map

 (topic1, partition0) - (startingOffset, endingOffset)
 (topic1, partition1) - (startingOffset, endingOffset)
 ...
 turn each one of those entries into a partition of an rdd, using the
 simple
 consumer.
 That's it.  No recovery logic, no state, nothing - for any failures, bail
 on the rdd and let it retry.
 Spark stays out of the business of being a distributed database.

 The client code does any transformation it wants, then stores the data and
 offsets.  There are two ways of doing this, either based on idempotence or
 a transactional data store.

 For idempotent stores:

 1.manipulate data
 2.save data to store
 3.save ending offsets to the same store

 If you fail between 2 and 3, the offsets haven't been stored, you start
 again at the same beginning offsets, do the same calculations in the same
 order, overwrite the same data, all is good.


 For transactional stores:

 1. manipulate data
 2. begin transaction
 3. save data to the store
 4. save offsets
 5. commit transaction

 If you fail before 5, the transaction rolls back.  To make this less
 heavyweight, you can write the data outside the transaction and then
 update
 a pointer to the current data inside the transaction.


 Again, spark has nothing much to do with guaranteeing exactly once.  In
 fact, the current streaming api actively impedes my ability to do the
 above.  I'm just suggesting providing an api that doesn't get in the way
 of
 exactly-once.





 On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Can you explain your basic algorithm for the once-only-delivery? It is
  quite a bit of very Kafka-specific code, that would take more time to
 read
  than I can currently afford? If you can explain your algorithm a bit, it
  might help.
 
  Thanks,
  Hari
 
 
  On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org
  wrote:
 
 
  The problems you guys are discussing come from trying to store state in
  spark, so don't do that.  Spark isn't a distributed database.
 
  Just map kafka partitions directly to rdds, llet user code specify the
  range of offsets explicitly, and let them be in charge of committing
  offsets.
 
  Using the simple consumer isn't that bad, I'm already using this in
  production with the code I linked to, and tresata apparently has been
 as
  well.  Again, for everyone saying this is impossible, have you read
 either
  of those implementations and looked at the approach?
 
 
 
  On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara 
  sean.mcnam...@webtrends.com wrote:
 
  Please feel free to correct me if I’m wrong, but I think the exactly
  once spark streaming semantics can easily be solved using
 updateStateByKey.
  Make the key going into updateStateByKey be a hash of the event, or
 pluck
  off some uuid from the message.  The updateFunc would only emit the
 message
  if the key did not exist, and the user has complete control over the
 window
  of time / state lifecycle for detecting duplicates.  It also makes it
  really easy to detect and take action (alert?) when you DO see a
 duplicate,
  or make memory tradeoffs within an error bound using a sketch
 algorithm.
  The kafka simple consumer is insanely complex, if possible I think it
 would
  be better (and vastly more flexible) to get reliability using the
  primitives that spark so elegantly provides.
 
  Cheers,
 
  Sean
 
 
   On Dec 19, 2014, at 12:06 PM, Hari Shreedharan 
  hshreedha...@cloudera.com wrote:
  
   Hi Dibyendu,
  
   Thanks for the details on the implementation. But I still do not
  believe
   that it is no duplicates - what they achieve is that the same batch
 is
   processed exactly the same way every time (but see it may be
 processed
  more
   than once) - so it depends on the operation being idempotent. I
 believe
   Trident uses ZK to keep track of the transactions - a batch can be
   processed multiple times in failure scenarios (for example, the
  transaction
   is processed but before ZK is updated the machine fails, causing a
  new
   node to process it again).
  
   I don't think it is impossible to do this in Spark Streaming as well
  and
   I'd be really interested in working on it at some point in the near
  future.
  
   On Fri, Dec 19, 2014 at 1

Re: Which committers care about Kafka?

2014-12-19 Thread Hari Shreedharan
Hi Dibyendu,

Thanks for the details on the implementation. But I still do not believe
that it is no duplicates - what they achieve is that the same batch is
processed exactly the same way every time (but see it may be processed more
than once) - so it depends on the operation being idempotent. I believe
Trident uses ZK to keep track of the transactions - a batch can be
processed multiple times in failure scenarios (for example, the transaction
is processed but before ZK is updated the machine fails, causing a new
node to process it again).

I don't think it is impossible to do this in Spark Streaming as well and
I'd be really interested in working on it at some point in the near future.

On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 Hi,

 Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
 Trident has done the exact-once guarantee by processing the tuple in a
 batch  and assigning same transaction-id for a given batch . The replay for
 a given batch with a transaction-id will have exact same set of tuples and
 replay of batches happen in exact same order before the failure.

 Having this paradigm, if downstream system process data for a given batch
 for having a given transaction-id , and if during failure if same batch is
 again emitted , you can check if same transaction-id is already processed
 or not and hence can guarantee exact once semantics.

 And this can only be achieved in Spark if we use Low Level Kafka consumer
 API to process the offsets. This low level Kafka Consumer (
 https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
 Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
 related logic has been taken from Storm-Kafka spout and which manages all
 Kafka re-balance and fault tolerant aspects and Kafka metadata managements.

 Presently this Consumer maintains that during Receiver failure, it will
 re-emit the exact same Block with same set of messages . Every message have
 the details of its partition, offset and topic related details which can
 tackle the SPARK-3146.

 As this Low Level consumer has complete control over the Kafka Offsets ,
 we can implement Trident like feature on top of it like having implement a
 transaction-id for a given block , and re-emit the same block with same set
 of message during Driver failure.

 Regards,
 Dibyendu


 On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Hi all,

 I agree with Hari that Strong exact-once semantics is very hard to
 guarantee, especially in the failure situation. From my understanding even
 current implementation of ReliableKafkaReceiver cannot fully guarantee the
 exact once semantics once failed, first is the ordering of data replaying
 from last checkpoint, this is hard to guarantee when multiple partitions
 are injected in; second is the design complexity of achieving this, you can
 refer to the Kafka Spout in Trident, we have to dig into the very details
 of Kafka metadata management system to achieve this, not to say rebalance
 and fault-tolerance.

 Thanks
 Jerry

 -Original Message-
 From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
 Sent: Friday, December 19, 2014 5:57 AM
 To: Cody Koeninger
 Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org
 Subject: Re: Which committers care about Kafka?

 But idempotency is not that easy t achieve sometimes. A strong only once
 semantic through a proper API would  be superuseful; but I'm not implying
 this is easy to achieve.
 On 18 Dec 2014 21:52, Cody Koeninger c...@koeninger.org wrote:

  If the downstream store for the output data is idempotent or
  transactional, and that downstream store also is the system of record
  for kafka offsets, then you have exactly-once semantics.  Commit
  offsets with / after the data is stored.  On any failure, restart from
 the last committed offsets.
 
  Yes, this approach is biased towards the etl-like use cases rather
  than near-realtime-analytics use cases.
 
  On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan 
  hshreedha...@cloudera.com
   wrote:
  
   I get what you are saying. But getting exactly once right is an
   extremely hard problem - especially in presence of failure. The
   issue is failures
  can
   happen in a bunch of places. For example, before the notification of
   downstream store being successful reaches the receiver that updates
   the offsets, the node fails. The store was successful, but
   duplicates came in either way. This is something worth discussing by
   itself - but without uuids etc this might not really be solved even
 when you think it is.
  
   Anyway, I will look at the links. Even I am interested in all of the
   features you mentioned - no HDFS WAL for Kafka and once-only
   delivery,
  but
   I doubt the latter is really possible to guarantee - though I really
  would
   love to have that!
  
   Thanks,
   Hari
  
  
   On Thu, Dec 18, 2014

Re: Which committers care about Kafka?

2014-12-19 Thread Hari Shreedharan
Can you explain your basic algorithm for the once-only-delivery? It is quite a 
bit of very Kafka-specific code, that would take more time to read than I can 
currently afford? If you can explain your algorithm a bit, it might help.




Thanks, Hari

On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org
wrote:

 The problems you guys are discussing come from trying to store state in
 spark, so don't do that.  Spark isn't a distributed database.
 Just map kafka partitions directly to rdds, llet user code specify the
 range of offsets explicitly, and let them be in charge of committing
 offsets.
 Using the simple consumer isn't that bad, I'm already using this in
 production with the code I linked to, and tresata apparently has been as
 well.  Again, for everyone saying this is impossible, have you read either
 of those implementations and looked at the approach?
 On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara sean.mcnam...@webtrends.com
 wrote:
 Please feel free to correct me if I’m wrong, but I think the exactly once
 spark streaming semantics can easily be solved using updateStateByKey. Make
 the key going into updateStateByKey be a hash of the event, or pluck off
 some uuid from the message.  The updateFunc would only emit the message if
 the key did not exist, and the user has complete control over the window of
 time / state lifecycle for detecting duplicates.  It also makes it really
 easy to detect and take action (alert?) when you DO see a duplicate, or
 make memory tradeoffs within an error bound using a sketch algorithm.  The
 kafka simple consumer is insanely complex, if possible I think it would be
 better (and vastly more flexible) to get reliability using the primitives
 that spark so elegantly provides.

 Cheers,

 Sean


  On Dec 19, 2014, at 12:06 PM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:
 
  Hi Dibyendu,
 
  Thanks for the details on the implementation. But I still do not believe
  that it is no duplicates - what they achieve is that the same batch is
  processed exactly the same way every time (but see it may be processed
 more
  than once) - so it depends on the operation being idempotent. I believe
  Trident uses ZK to keep track of the transactions - a batch can be
  processed multiple times in failure scenarios (for example, the
 transaction
  is processed but before ZK is updated the machine fails, causing a new
  node to process it again).
 
  I don't think it is impossible to do this in Spark Streaming as well and
  I'd be really interested in working on it at some point in the near
 future.
 
  On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya 
  dibyendu.bhattach...@gmail.com wrote:
 
  Hi,
 
  Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm
  Trident has done the exact-once guarantee by processing the tuple in a
  batch  and assigning same transaction-id for a given batch . The replay
 for
  a given batch with a transaction-id will have exact same set of tuples
 and
  replay of batches happen in exact same order before the failure.
 
  Having this paradigm, if downstream system process data for a given
 batch
  for having a given transaction-id , and if during failure if same batch
 is
  again emitted , you can check if same transaction-id is already
 processed
  or not and hence can guarantee exact once semantics.
 
  And this can only be achieved in Spark if we use Low Level Kafka
 consumer
  API to process the offsets. This low level Kafka Consumer (
  https://github.com/dibbhatt/kafka-spark-consumer) has implemented the
  Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka
  related logic has been taken from Storm-Kafka spout and which manages
 all
  Kafka re-balance and fault tolerant aspects and Kafka metadata
 managements.
 
  Presently this Consumer maintains that during Receiver failure, it will
  re-emit the exact same Block with same set of messages . Every message
 have
  the details of its partition, offset and topic related details which can
  tackle the SPARK-3146.
 
  As this Low Level consumer has complete control over the Kafka Offsets ,
  we can implement Trident like feature on top of it like having
 implement a
  transaction-id for a given block , and re-emit the same block with same
 set
  of message during Driver failure.
 
  Regards,
  Dibyendu
 
 
  On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai saisai.s...@intel.com
  wrote:
 
  Hi all,
 
  I agree with Hari that Strong exact-once semantics is very hard to
  guarantee, especially in the failure situation. From my understanding
 even
  current implementation of ReliableKafkaReceiver cannot fully guarantee
 the
  exact once semantics once failed, first is the ordering of data
 replaying
  from last checkpoint, this is hard to guarantee when multiple
 partitions
  are injected in; second is the design complexity of achieving this,
 you can
  refer to the Kafka Spout in Trident, we have to dig into the very
 details
  of Kafka

Re: Which committers care about Kafka?

2014-12-18 Thread Hari Shreedharan
I get what you are saying. But getting exactly once right is an extremely hard 
problem - especially in presence of failure. The issue is failures can happen 
in a bunch of places. For example, before the notification of downstream store 
being successful reaches the receiver that updates the offsets, the node fails. 
The store was successful, but duplicates came in either way. This is something 
worth discussing by itself - but without uuids etc this might not really be 
solved even when you think it is.




Anyway, I will look at the links. Even I am interested in all of the features 
you mentioned - no HDFS WAL for Kafka and once-only delivery, but I doubt the 
latter is really possible to guarantee - though I really would love to have 
that!




Thanks, Hari

On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger c...@koeninger.org
wrote:

 Thanks for the replies.
 Regarding skipping WAL, it's not just about optimization.  If you actually
 want exactly-once semantics, you need control of kafka offsets as well,
 including the ability to not use zookeeper as the system of record for
 offsets.  Kafka already is a reliable system that has strong ordering
 guarantees (within a partition) and does not mandate the use of zookeeper
 to store offsets.  I think there should be a spark api that acts as a very
 simple intermediary between Kafka and the user's choice of downstream store.
 Take a look at the links I posted - if there's already been 2 independent
 implementations of the idea, chances are it's something people need.
 On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 Hi Cody,

 I am an absolute +1 on SPARK-3146. I think we can implement something
 pretty simple and lightweight for that one.

 For the Kafka DStream skipping the WAL implementation - this is something
 I discussed with TD a few weeks ago. Though it is a good idea to implement
 this to avoid unnecessary HDFS writes, it is an optimization. For that
 reason, we must be careful in implementation. There are a couple of issues
 that we need to ensure works properly - specifically ordering. To ensure we
 pull messages from different topics and partitions in the same order after
 failure, we’d still have to persist the metadata to HDFS (or some other
 system) - this metadata must contain the order of messages consumed, so we
 know how to re-read the messages. I am planning to explore this once I have
 some time (probably in Jan). In addition, we must also ensure bucketing
 functions work fine as well. I will file a placeholder jira for this one.

 I also wrote an API to write data back to Kafka a while back -
 https://github.com/apache/spark/pull/2994 . I am hoping that this will
 get pulled in soon, as this is something I know people want. I am open to
 feedback on that - anything that I can do to make it better.

 Thanks,
 Hari


 On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Cody,

 Thanks for reaching out with this. The lead on streaming is TD - he is
 traveling this week though so I can respond a bit. To the high level
 point of whether Kafka is important - it definitely is. Something like
 80% of Spark Streaming deployments (anecdotally) ingest data from
 Kafka. Also, good support for Kafka is something we generally want in
 Spark and not a library. In some cases IIRC there were user libraries
 that used unstable Kafka API's and we were somewhat waiting on Kafka
 to stabilize them to merge things upstream. Otherwise users wouldn't
 be able to use newer Kakfa versions. This is a high level impression
 only though, I haven't talked to TD about this recently so it's worth
 revisiting given the developments in Kafka.

 Please do bring things up like this on the dev list if there are
 blockers for your usage - thanks for pinging it.

 - Patrick

 On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger c...@koeninger.org
 wrote:
  Now that 1.2 is finalized... who are the go-to people to get some
  long-standing Kafka related issues resolved?
 
  The existing api is not sufficiently safe nor flexible for our
 production
  use. I don't think we're alone in this viewpoint, because I've seen
  several different patches and libraries to fix the same things we've
 been
  running into.
 
  Regarding flexibility
 
  https://issues.apache.org/jira/browse/SPARK-3146
 
  has been outstanding since August, and IMHO an equivalent of this is
  absolutely necessary. We wrote a similar patch ourselves, then found
 that
  PR and have been running it in production. We wouldn't be able to get
 our
  jobs done without it. It also allows users to solve a whole class of
  problems for themselves (e.g. SPARK-2388, arbitrary delay of messages,
 etc).
 
  Regarding safety, I understand the motivation behind WriteAheadLog as a
  general solution for streaming unreliable sources, but Kafka already is
 a
  reliable source. I think there's a need for an api that treats it as
  such. Even aside from the performance

Re: Has anyone else observed this build break?

2014-11-14 Thread Hari Shreedharan
Seems like a comment on that page mentions a fix, which would add yet another 
profile though — specifically telling mvn that if it is an apple jdk, use the 
classes.jar as the tools.jar as well, since Apple-packaged JDK 6 bundled them 
together.




Link: http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320


I didn’t test it, but maybe this can fix it?


Thanks,
Hari

On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell pwend...@gmail.com
wrote:

 A work around for this fix is identified here:
 http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html
 However, if this affects more users I'd prefer to just fix it properly
 in our build.
 On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote:
 A recent patch broke clean builds for me, I am trying to see how
 widespread this issue is and whether we need to revert the patch.

 The error I've seen is this when building the examples project:

 spark-examples_2.10: Could not resolve dependencies for project
 org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not
 find artifact jdk.tools:jdk.tools:jar:1.7 at specified path
 /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar

 The reason for this error is that hbase-annotations is using a
 system scoped dependency in their hbase-annotations pom, and this
 doesn't work with certain JDK layouts such as that provided on Mac OS:

 http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom

 Has anyone else seen this or is it just me?

 - Patrick
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark-Submit issues

2014-11-12 Thread Hari Shreedharan
Yep, you’d need to shade jars to ensure all your dependencies are in the 
classpath.


Thanks,
Hari

On Wed, Nov 12, 2014 at 3:23 AM, Ted Malaska ted.mala...@cloudera.com
wrote:

 Hey this is Ted
 Are you using Shade when you build your jar and are you using the bigger
 jar?  Looks like classes are not included in you jar.
 On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson 
 jeniba.john...@lntinfotech.com wrote:
 Hi Hari,

 Now Iam trying out the same FlumeEventCount example running with
 spark-submit Instead of run example. The steps I followed is that I have
 exported the JavaFlumeEventCount.java into jar.

 The command used is
 ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar
 --master local --class org.JavaFlumeEventCount  bin/flumeeventcnt2.jar
 localhost 2323

 The output is
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 If I use this command
  ./bin/spark-submit --master local --class org.JavaFlumeEventCount
 bin/flumeeventcnt2.jar  localhost 2323

 Then I get an error
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/spark/examples/streaming/StreamingExamples
 at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.examples.streaming.StreamingExamples
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 ... 8 more


 I Just wanted to ask is  that it is able to  find spark-assembly.jar but
 why not spark-example.jar.
 The next doubt is  while running FlumeEventCount example through runexample

 I get an output as
 Received 4 flume events.

 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming job
 1415797214000 ms.0 from job set of time 1415797214000 ms
 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence list

 But If I run the same program through Spark-Submit

 I get an output as
 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1
 blocks
 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time
 1415795102000

 So I need a clarification, since in the program the printing statement is
 written as  Received n flume events. So how come Iam able to see as
 Stream 0 received n blocks.
 And what is the difference of running the program through spark-submit and
 run-example.

 Awaiting for your kind reply

 Regards,
 Jeniba Johnson


 
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail


Re: Bind exception while running FlumeEventCount

2014-11-10 Thread Hari Shreedharan
Looks like that port is not available because another app is using that port. 
Can you take a look at netstat -a and use a port that is free?


Thanks,
Hari

On Fri, Nov 7, 2014 at 2:05 PM, Jeniba Johnson
jeniba.john...@lntinfotech.com wrote:

 Hi,
 I have installed spark-1.1.0 and  apache flume 1.4 for running  streaming 
 example FlumeEventCount. Previously the code was working fine. Now Iam facing 
 with the below mentioned issues. My flume is running properly it is able to 
 write the file.
 The command I use is
 bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 
 172.29.17.178  65001
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
 with message: Error starting receiver 0: 
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 14/11/07 23:19:23 INFO flume.FlumeReceiver: Flume receiver stopped
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Deregistering 
 receiver 0
 14/11/07 23:19:23 ERROR scheduler.ReceiverTracker: Deregistered receiver for 
 stream 0: Error starting receiver 0 - 
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 at 
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:68)
 at 
 org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:164)
 at 
 org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:171)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 at 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
 at 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.net.BindException: Address already in use
 at sun.nio.ch.Net.bind0(Native Method)
 at sun.nio.ch.Net.bind(Net.java:344)
 at sun.nio.ch.Net.bind(Net.java:336)
 at 
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199)
 at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
 at 
 org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
 at 
 org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
 ... 3 more
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopping BlockGenerator
 14/11/07 23:19:23 INFO util.RecurringTimer: Stopped timer for BlockGenerator 
 after time 1415382563200
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Waiting for block pushing 
 thread
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Pushing out the last 0 blocks
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped block pushing thread
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped BlockGenerator
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Waiting for executor 
 stop is over
 14/11/07 23:19:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor 
 with error: org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 14/11/07 23:19:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
 (TID 0)
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 at 
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119)
 at 

RE: Bind exception while running FlumeEventCount

2014-11-10 Thread Hari Shreedharan
First, can you try a different port?




TIME_WAIT is basically a timeout for a socket to be completely decommissioned 
for the port to be available for binding. Once you wait for a few minutes and 
if you still see a startup issue, can you also send the error logs? From what I 
can see, the port seems to be in use.


Thanks,
Hari

Re: Bind exception while running FlumeEventCount

2014-11-10 Thread Hari Shreedharan
Did you start a Flume agent to push data to the relevant port?


Thanks,
Hari

On Fri, Nov 7, 2014 at 2:05 PM, Jeniba Johnson
jeniba.john...@lntinfotech.com wrote:

 Hi,
 I have installed spark-1.1.0 and  apache flume 1.4 for running  streaming 
 example FlumeEventCount. Previously the code was working fine. Now Iam facing 
 with the below mentioned issues. My flume is running properly it is able to 
 write the file.
 The command I use is
 bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 
 172.29.17.178  65001
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopping receiver 
 with message: Error starting receiver 0: 
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 14/11/07 23:19:23 INFO flume.FlumeReceiver: Flume receiver stopped
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Deregistering 
 receiver 0
 14/11/07 23:19:23 ERROR scheduler.ReceiverTracker: Deregistered receiver for 
 stream 0: Error starting receiver 0 - 
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 at 
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:68)
 at 
 org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:164)
 at 
 org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:171)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
 at 
 org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
 at 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
 at 
 org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.net.BindException: Address already in use
 at sun.nio.ch.Net.bind0(Native Method)
 at sun.nio.ch.Net.bind(Net.java:344)
 at sun.nio.ch.Net.bind(Net.java:336)
 at 
 sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199)
 at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
 at 
 org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
 at 
 org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
 at 
 org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
 ... 3 more
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopping BlockGenerator
 14/11/07 23:19:23 INFO util.RecurringTimer: Stopped timer for BlockGenerator 
 after time 1415382563200
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Waiting for block pushing 
 thread
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Pushing out the last 0 blocks
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped block pushing thread
 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped BlockGenerator
 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Waiting for executor 
 stop is over
 14/11/07 23:19:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor 
 with error: org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 14/11/07 23:19:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
 (TID 0)
 org.jboss.netty.channel.ChannelException: Failed to bind to: 
 /172.29.17.178:65001
 at 
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119)
 at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74)
 at 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Hari Shreedharan
In Cloudstack, I believe one becomes a maintainer first for a subset of 
modules, before he/she becomes a proven maintainter who has commit rights on 
the entire source tree. 




So would it make sense to go that route, and have committers voted in as 
maintainers for certain parts of the codebase and then eventually become proven 
maintainers (though this might have be honor code based, since I don’t think 
git allows per module commit rights).


Thanks,
Hari

On Thu, Nov 6, 2014 at 3:45 PM, Patrick Wendell pwend...@gmail.com
wrote:

 I think new committers might or might not be maintainers (it would
 depend on the PMC vote). I don't think it would affect what you could
 merge, you can merge in any part of the source tree, you just need to
 get sign off if you want to touch a public API or make major
 architectural changes. Most projects already require code review from
 other committers before you commit something, so it's just a version
 of that where you have specific people appointed to specific
 components for review.
 If you look, most large software projects have a maintainer model,
 both in Apache and outside of it. Cloudstack is probably the best
 example in Apache since they are the second most active project
 (roughly) after Spark. They have two levels of maintainers and much
 strong language - their language: In general, maintainers only have
 commit rights on the module for which they are responsible..
 I'd like us to start with something simpler and lightweight as
 proposed here. Really the proposal on the table is just to codify the
 current de-facto process to make sure we stick by it as we scale. If
 we want to add more formality to it or strictness, we can do it later.
 - Patrick
 On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
 How would this model work with a new committer who gets voted in? Does it 
 mean that a new committer would be a maintainer for at least one area -- 
 else we could end up having committers who really can't merge anything 
 significant until he becomes a maintainer.


 Thanks,
 Hari

 On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 I think you're misunderstanding the idea of process here. The point of 
 process is to make sure something happens automatically, which is useful to 
 ensure a certain level of quality. For example, all our patches go through 
 Jenkins, and nobody will make the mistake of merging them if they fail 
 tests, or RAT checks, or API compatibility checks. The idea is to get the 
 same kind of automation for design on these components. This is a very 
 common process for large software projects, and it's essentially what we 
 had already, but formalizing it will make clear that this is the process we 
 want. It's important to do it early in order to be able to refine the 
 process as the project grows.
 In terms of scope, again, the maintainers are *not* going to be the only 
 reviewers for that component, they are just a second level of sign-off 
 required for architecture and API. Being a maintainer is also not a 
 promotion, it's a responsibility. Since we don't have much experience yet 
 with this model, I didn't propose automatic rules beyond that the PMC can 
 add / remove maintainers -- presumably the PMC is in the best position to 
 know what the project needs. I think automatic rules are exactly the kind 
 of process you're arguing against. The process here is about ensuring 
 certain checks are made for every code change, not about automating 
 personnel and development decisions.
 In any case, I appreciate your input on this, and we're going to evaluate 
 the model to see how it goes. It might be that we decide we don't want it 
 at all. However, from what I've seen of other projects (not Hadoop but 
 projects with an order of magnitude more contributors, like Python or 
 Linux), this is one of the best ways to have consistently great releases 
 with a large contributor base and little room for error. With all due 
 respect to what Hadoop's accomplished, I wouldn't use Hadoop as the best 
 example to strive for; in my experience there I've seen patches reverted 
 because of architectural disagreements, new APIs released and abandoned, 
 and generally an experience that's been painful for users. A lot of the 
 decisions we've made in Spark (e.g. time-based release cycle, built-in 
 libraries, API stability rules, etc) were based on lessons learned there, 
 in an attempt to define a better model.
 Matei
 On Nov 6, 2014, at 2:18 PM, bc Wong bcwal...@cloudera.com wrote:

 On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 snip
 Ultimately, the core motivation is that the project has grown to the point 
 where it's hard to expect every committer to have full understanding of 
 every component. Some committers know a ton about systems but little about 
 machine learning, some are algorithmic whizzes but may

Re: Build fails on master (f90ad5d)

2014-11-04 Thread Hari Shreedharan
I have seen this on sbt sometimes. I usually do an sbt clean and that fixes it.


Thanks,
Hari

On Tue, Nov 4, 2014 at 3:13 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

 FWIW, the official build instructions are here:
 https://github.com/apache/spark#building-spark
 On Tue, Nov 4, 2014 at 5:11 PM, Ted Yu yuzhih...@gmail.com wrote:
 I built based on this commit today and the build was successful.

 What command did you use ?

 Cheers

 On Tue, Nov 4, 2014 at 2:08 PM, Alessandro Baretta alexbare...@gmail.com
 wrote:

  Fellow Sparkers,
 
  I am new here and still trying to learn to crawl. Please, bear with me.
 
  I just pulled f90ad5d from https://github.com/apache/spark.git and am
  running the compile command in the sbt shell. This is the error I'm
 seeing:
 
  [error]
 
 
 /home/alex/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala:32:
  object sql is not a member of package org.apache.spark
  [error] import org.apache.spark.sql.catalyst.types._
  [error] ^
 
  Am I doing something obscenely stupid is the build genuinely broken?
 
  Alex
 


Re: Moving PR Builder to mvn

2014-10-24 Thread Hari Shreedharan
I have zinc server running on my mac, and I see maven compilation to be much 
better than before I had it running. Is the sbt build still faster (sorry, long 
time since I did a build with sbt).


Thanks,
Hari

On Fri, Oct 24, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Overall I think this would be a good idea. The main blocker is just
 that I think the Maven build is much slower right now than the SBT
 build. However, if we were able to e.g. parallelize the test build on
 Jenkins that might make up for it.
 I'd actually like to have a trigger where we could tests pull requests
 with either one.
 - Patrick
 On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
 Over the last few months, it seems like we have selected Maven to be the 
 official build system for Spark.


 I realize that removing the sbt build may not be easy, but it might be a 
 good idea to start looking into that. We had issues over the past few days 
 where mvn builds were fine, while sbt was failing to resolve dependencies 
 which were test-jars causing compilation of certain tests to fail.


 As a first step, I am wondering if it might be a good idea to change the PR 
 builder to mvn and test PRs consistent with the way we test releases. I am 
 not sure how technically feasible it is, but it would be a start to 
 standardizing on one build system.

 Thanks,
 Hari

Re: Moving PR Builder to mvn

2014-10-24 Thread Hari Shreedharan
+1. From what I can see, it definitely does - though I must say I rarely do 
full end to end builds though. Maybe worth running as an experiment?


Thanks,
Hari

On Fri, Oct 24, 2014 at 2:34 PM, Stephen Boesch java...@gmail.com wrote:

 Zinc absolutely helps - feels like makes builds  more than twice as fast -
 both on Mac and Linux.   It helps both on fresh and existing builds.
 2014-10-24 14:06 GMT-07:00 Patrick Wendell pwend...@gmail.com:
 Does Zinc still help if you are just running a single totally fresh
 build? For the pull request builder we purge all state from previous
 builds.

 - Patrick

 On Fri, Oct 24, 2014 at 1:55 PM, Hari Shreedharan
 hshreedha...@cloudera.com wrote:
  I have zinc server running on my mac, and I see maven compilation to be
 much
  better than before I had it running. Is the sbt build still faster
 (sorry,
  long time since I did a build with sbt).
 
  Thanks,
  Hari
 
 
  On Fri, Oct 24, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Overall I think this would be a good idea. The main blocker is just
  that I think the Maven build is much slower right now than the SBT
  build. However, if we were able to e.g. parallelize the test build on
  Jenkins that might make up for it.
 
  I'd actually like to have a trigger where we could tests pull requests
  with either one.
 
  - Patrick
 
  On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan
  hshreedha...@cloudera.com wrote:
   Over the last few months, it seems like we have selected Maven to be
 the
   official build system for Spark.
  
  
   I realize that removing the sbt build may not be easy, but it might
 be a
   good idea to start looking into that. We had issues over the past few
 days
   where mvn builds were fine, while sbt was failing to resolve
 dependencies
   which were test-jars causing compilation of certain tests to fail.
  
  
   As a first step, I am wondering if it might be a good idea to change
 the
   PR builder to mvn and test PRs consistent with the way we test
 releases. I
   am not sure how technically feasible it is, but it would be a start to
   standardizing on one build system.
  
   Thanks,
   Hari
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building and Running Spark on OS X

2014-10-20 Thread Hari Shreedharan
The sbt executable that is in the spark repo can be used to build sbt without 
any other set up (it will download the sbt jars etc).


Thanks,
Hari

On Mon, Oct 20, 2014 at 5:16 PM, Sean Owen so...@cloudera.com wrote:

 Maven is at least built in to OS X (well, with dev tools). You don't
 even have to brew install it. Surely SBT isn't in the dev tools even?
 I recall I had to install it. I'd be surprised to hear it required
 zero setup.
 On Mon, Oct 20, 2014 at 8:04 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 Yeah, I would use sbt too, but I thought if I wanted to publish a little
 reference page for OS X users then I probably should use the “official
 https://github.com/apache/spark#building-spark“ build instructions.

 Nick


 On Mon, Oct 20, 2014 at 8:00 PM, Reynold Xin r...@databricks.com wrote:

 I usually use SBT on Mac and that one doesn't require any setup ...


 On Mon, Oct 20, 2014 at 4:43 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 If one were to put together a short but comprehensive guide to setting up
 Spark to run locally on OS X, would it look like this?

 # Install Maven. On OS X, we suggest using Homebrew.
 brew install maven
 # Set some important Java and Maven environment variables.export
 JAVA_HOME=$(/usr/libexec/java_home)export MAVEN_OPTS=-Xmx512m
 -XX:MaxPermSize=128m
 # Go to where you downloaded the Spark source.cd ./spark
 # Build, configure slaves, and startup Spark.
 mvn -DskipTests clean packageecho localhost  ./conf/slaves
 ./sbin/start-all.sh
 # Rock 'n' Roll.
 ./bin/pyspark
 # Cleanup when you're done.
 ./sbin/stop-all.sh

 Nick




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Hari Shreedharan
Sean - I think only the ones in 1726 are enough. It is weird that any 
class that uses the test-jar actually requires the streaming jar to be 
added explicitly. Shouldn't maven take care of this?


I posted some comments on the PR.

--

Thanks,
Hari



Sean Owen mailto:so...@cloudera.com
August 22, 2014 at 3:58 PM
Yes, master hasn't compiled for me for a few days. It's fixed in:

https://github.com/apache/spark/pull/1726
https://github.com/apache/spark/pull/2075

Could a committer sort this out?

Sean


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Ted Yu mailto:yuzhih...@gmail.com
August 22, 2014 at 1:55 PM
Hi,
Using the following command on (refreshed) master branch:
mvn clean package -DskipTests

I got:

constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/
---
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference.
A signature in TestSuiteBase.class refers to term dstream
in package org.apache.spark.streaming which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
TestSuiteBase.class.
at
scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847)
at
scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854)
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280)
at
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
at scala.collection.immutable.List.forall(List.scala:84)
at 
scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280)

at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202)
at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638)
at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754)
at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773)
at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224)
at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)

at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at 
xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296)

at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293)
at xsbt.Message$$anon$1.apply(Message.scala:8)
at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8)
at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20)
at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18)
at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138)
at scala.collection.immutable.List.foreach(List.scala:318)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138)
at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139)
at xsbt.API$ApiPhase.processScalaUnit(API.scala:54)
at xsbt.API$ApiPhase.processUnit(API.scala:38)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 

Re: Spark Avro Generation

2014-08-11 Thread Hari Shreedharan
Jay running sbt compile or assembly should generate the sources.

On Monday, August 11, 2014, Devl Devel devl.developm...@gmail.com wrote:

 Hi

 So far I've been managing to build Spark from source but since a change in
 spark-streaming-flume I have no idea how to generate classes (e.g.
 SparkFlumeProtocol) from the avro schema.

 I have used sbt to run avro:generate (from the top level spark dir) but it
 produces nothing - it just says:

  avro:generate
 [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM.

 Please can someone send me their build.sbt or just tell me how to build
 spark so that all avro files get generated as well?

 Sorry for the noob question but I really have tried by best on this one!
 Cheers



Re: sbt/sbt test steals window focus on OS X

2014-07-20 Thread Hari Shreedharan
Add this to your .bash_profile (or .bashrc) - that will fix it.

export _JAVA_OPTIONS=-Djava.awt.headless=true


Hari


On Sun, Jul 20, 2014 at 1:56 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I just created SPARK-2602 
 https://issues.apache.org/jira/browse/SPARK-2602 to
 track this issue.

 Are there others who can confirm this is an issue? Also, does this issue
 extend to other OSes?

 Nick