Re: SPARK-13843 and future of streaming backends
I have worked with various ASF projects for 4+ years now. Sure, ASF projects can delete code as they feel fit. But this is the first time I have really seen code being "moved out" of a project without discussion. I am sure you can do this without violating ASF policy, but the explanation for that would be convoluted (someone decided to make a copy and then the ASF project deleted it?). Also, moving the code out would break compatibility. AFAIK, there is no way to push org.apache.* artifacts directly to maven central. That happens via mirroring from the ASF maven repos. Even if it you could somehow directly push the artifacts to mvn, you really can push to org.apache.* groups only if you are part of the repo and acting as an agent of that project (which in this case would be Apache Spark). Once you move the code out, even a committer/PMC member would not be representing the ASF when pushing the code. I am not sure if there is a way to fix this issue. Thanks, Hari On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharanwrote: > I am not referring to code edits - but to migrating submodules and > code currently in Apache Spark to 'outside' of it. > If I understand correctly, assets from Apache Spark are being moved > out of it into thirdparty external repositories - not owned by Apache. > > At a minimum, dev@ discussion (like this one) should be initiated. > As PMC is responsible for the project assets (including code), signoff > is required for it IMO. > > More experienced Apache members might be opine better in case I got it > wrong ! > > > Regards, > Mridul > > > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger > wrote: > > Why would a PMC vote be necessary on every code deletion? > > > > There was a Jira and pull request discussion about the submodules that > > have been removed so far. > > > > https://issues.apache.org/jira/browse/SPARK-13843 > > > > There's another ongoing one about Kafka specifically > > > > https://issues.apache.org/jira/browse/SPARK-13877 > > > > > > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan > wrote: > >> > >> I was not aware of a discussion in Dev list about this - agree with > most of > >> the observations. > >> In addition, I did not see PMC signoff on moving (sub-)modules out. > >> > >> Regards > >> Mridul > >> > >> > >> > >> On Thursday, March 17, 2016, Marcelo Vanzin > wrote: > >>> > >>> Hello all, > >>> > >>> Recently a lot of the streaming backends were moved to a separate > >>> project on github and removed from the main Spark repo. > >>> > >>> While I think the idea is great, I'm a little worried about the > >>> execution. Some concerns were already raised on the bug mentioned > >>> above, but I'd like to have a more explicit discussion about this so > >>> things don't fall through the cracks. > >>> > >>> Mainly I have three concerns. > >>> > >>> i. Ownership > >>> > >>> That code used to be run by the ASF, but now it's hosted in a github > >>> repo owned not by the ASF. That sounds a little sub-optimal, if not > >>> problematic. > >>> > >>> ii. Governance > >>> > >>> Similar to the above; who has commit access to the above repos? Will > >>> all the Spark committers, present and future, have commit access to > >>> all of those repos? Are they still going to be considered part of > >>> Spark and have release management done through the Spark community? > >>> > >>> > >>> For both of the questions above, why are they not turned into > >>> sub-projects of Spark and hosted on the ASF repos? I believe there is > >>> a mechanism to do that, without the need to keep the code in the main > >>> Spark repo, right? > >>> > >>> iii. Usability > >>> > >>> This is another thing I don't see discussed. For Scala-based code > >>> things don't change much, I guess, if the artifact names don't change > >>> (another reason to keep things in the ASF?), but what about python? > >>> How are pyspark users expected to get that code going forward, since > >>> it's not in Spark's pyspark.zip anymore? > >>> > >>> > >>> Is there an easy way of keeping these things within the ASF Spark > >>> project? I think that would be better for everybody. > >>> > >>> -- > >>> Marcelo > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: dev-h...@spark.apache.org > >>> > >> > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: Long running Spark job on YARN throws "No AMRMToken"
The credentials file approach (using keytab for spark apps) will only update HDFS tokens. YARN's AMRM tokens should be taken care of by YARN internally. Steve - correct me if I am wrong here: If the AMRM tokens are disappearing it might be a YARN bug (does the AMRM token have a 7 day limit as well? I thought that was only for HDFS). Thanks, Hari On Tue, Feb 9, 2016 at 8:44 AM, Steve Loughranwrote: > > On 9 Feb 2016, at 11:26, Steve Loughran wrote: > > > On 9 Feb 2016, at 05:55, Prabhu Joseph wrote: > > + Spark-Dev > > On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph > wrote: > >> Hi All, >> >> A long running Spark job on YARN throws below exception after running >> for few days. >> >> yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row. >> org.apache.hadoop.yarn.exceptions.YarnException: *No AMRMToken found* for >> user prabhu at org.apache.hadoop.yarn.ipc.RPC >> Util.getRemoteException(RPCUtil.java:45) >> >> Do any of the below renew the AMRMToken and solve the issue >> >> 1. yarn-resourcemanager.delegation.token.max-lifetime increase from 7 days >> >> 2. Configuring Proxy user: >> >> hadoop.proxyuser.yarn.hosts * >> >> hadoop.proxyuser.yarn.groups * >> >> > > wouldnt do that: security issues > > >> 3. Can Spark-1.4.0 handle with fix >> https://issues.apache.org/jira/browse/SPARK-5342 >> >> spark.yarn.credentials.file >> >> >> > I'll say "maybe" there > > > uprated to a no, having looked at the code more > > > How to renew the AMRMToken for a long running job on YARN? >> >> >> > > AMRM token renewal should be automatic in AM; Yarn sends a message to the > AM (actually an allocate() response with no containers but a new token at > the tail of the message. > > i don't see any logging in the Hadoopp code there (AMRMClientImpl); filed > YARN-4682 to add a log statement > > if someone other than me were to supply a patch to that JIRA to add a log > statement *by the end of the day* I'll review it and get it in to Hadoop 2.8 > > > like I said: I'll get this in to hadoop-2.8 if someone is timely with the > diff > >
Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?
+1, much better than having a new PR each time to fix something for scala-2.11 every time a patch breaks it. Thanks, Hari Shreedharan > On Oct 9, 2015, at 11:47 AM, Michael Armbrust <mich...@databricks.com> wrote: > > How about just fixing the warning? I get it; it doesn't stop this from > happening again, but still seems less drastic than tossing out the > whole mechanism. > > +1 > > It also does not seem that expensive to test only compilation for Scala 2.11 > on PR builds.
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1. Build looks good, ran a couple apps on YARN Thanks, Hari On Fri, Jun 5, 2015 at 10:52 AM, Yin Huai yh...@databricks.com wrote: Sean, Can you add -Phive -Phive-thriftserver and try those Hive tests? Thanks, Yin On Fri, Jun 5, 2015 at 5:19 AM, Sean Owen so...@cloudera.com wrote: Everything checks out again, and the tests pass for me on Ubuntu + Java 7 with '-Pyarn -Phadoop-2.6', except that I always get SparkSubmitSuite errors like ... - success sanity check *** FAILED *** java.lang.RuntimeException: [download failed: org.jboss.netty#netty;3.2.2.Final!netty.jar(bundle), download failed: commons-net#commons-net;3.1!commons-net.jar] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:978) at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$3.apply(IsolatedClientLoader.scala:62) ... I also can't get hive tests to pass. Is anyone else seeing anything like this? if not I'll assume this is something specific to the env -- or that I don't have the build invocation just right. It's puzzling since it's so consistent, but I presume others' tests pass and Jenkins does. On Wed, Jun 3, 2015 at 5:53 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1
Ah, ok. It was missing in the list of jiras. So +1. Thanks, Hari On Mon, Apr 6, 2015 at 11:36 AM, Patrick Wendell pwend...@gmail.com wrote: I believe TD just forgot to set the fix version on the JIRA. There is a fix for this in 1.3: https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e - Patrick On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra m...@clearstorydata.com wrote: Is that correct, or is the JIRA just out of sync, since TD's PR was merged? https://github.com/apache/spark/pull/5008 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: It does not look like https://issues.apache.org/jira/browse/SPARK-6222 made it. It was targeted towards this release. Thanks, Hari On Mon, Apr 6, 2015 at 11:04 AM, York, Brennon brennon.y...@capitalone.com wrote: +1 (non-binding) Tested GraphX, build infrastructure, core test suite on OSX 10.9 w/ Java 1.7/1.8 On 4/6/15, 5:21 AM, Sean Owen so...@cloudera.com wrote: SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just resolved it for 1.4 anyway. False alarm there. I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick it up if there's another RC, but by itself is not something that needs a new RC. (I will give the same treatment to branch 1.2 if needed in light of the 1.2.2 release.) I applied the simple change in SPARK-6205 in order to continue executing tests and all was well. I still see a few failures in Hive tests: - show_create_table_serde *** FAILED *** - show_tblproperties *** FAILED *** - udf_std *** FAILED *** - udf_stddev *** FAILED *** with ... mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 -DskipTests clean package; mvn -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Dhadoop.version=2.6.0 test ... but these are not regressions from 1.3.0. +1 from me at this point on the current artifacts. On Sun, Apr 5, 2015 at 9:24 AM, Sean Owen so...@cloudera.com wrote: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in 1.3.0, which is minor and already fixed. I don't know why I didn't back-port it: https://issues.apache.org/jira/browse/SPARK-6205 If we roll another, let's get this easy fix in, but it is only an issue with tests. On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and all look legitimate (e.g. reopened or in progress) There is 1 open Blocker for 1.3.1 per Andrew: https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't start even when spark was built in Windows I believe this can be resolved quickly but as a matter of hygiene should be fixed or demoted before release. FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth examining before release to see how critical they are: SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application,,Open,4/3/15 SPARK-6484,Ganglia metrics xml reporter doesn't escape correctly,Josh Rosen,Open,3/24/15 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15 SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/2/15 SPARK-5113,Audit and document use of hostnames and IP addresses in Spark,,Open,3/24/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick Wendell,Reopened,3/23/15 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4751,Support dynamic allocation for standalone mode,Andrew Or,Open,12/22/14 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4352,Incorporate locality preferences in dynamic allocation requests,,Open,1/26/15 SPARK-4227,Document external shuffle service,,Open,3/23/15 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f3 1b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release
Re: PR Builder timing out due to ivy cache lock
Here you are: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28571/consoleFull On Fri, Mar 13, 2015 at 11:58 AM, shane knapp skn...@berkeley.edu wrote: link to a build, please? On Fri, Mar 13, 2015 at 11:53 AM, Hari Shreedharan hshreedha...@cloudera.com wrote: Looks like something is causing the PR Builder to timeout since this morning with the ivy cache being locked. Any idea what is happening?
Re: Improving metadata in Spark JIRA
+1. Jira cleanup would be good. Please let me know if I can help in some way! Thanks, Hari On Fri, Feb 6, 2015 at 11:56 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Do we need some new components to be added to the JIRA project? Like: - scheduler - YARN - spark-submit - …? Nick On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: +9000 on cleaning up JIRA. Thank you Sean for laying out some specific things to tackle. I will assist with this. Regarding email, I think Sandy is right. I only get JIRA email for issues I'm watching. Nick On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com wrote: JIRA updates don't go to this list, they go to iss...@spark.apache.org. I don't think many are signed up for that list, and those that are probably have a flood of emails anyway. So I'd definitely be in favor of any JIRA cleanup that you're up for. -Sandy On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote: I've wasted no time in wielding the commit bit to complete a number of small, uncontroversial changes. I wouldn't commit anything that didn't already appear to have review, consensus and little risk, but please let me know if anything looked a little too bold, so I can calibrate. Anyway, I'd like to continue some small house-cleaning by improving the state of JIRA's metadata, in order to let it give us a little clearer view on what's happening in the project: a. Add Component to every (open) issue that's missing one b. Review all Critical / Blocker issues to de-escalate ones that seem obviously neither c. Correct open issues that list a Fix version that has already been released d. Close all issues Resolved for a release that has already been released The problem with doing so is that it will create a tremendous amount of email to the list, like, several hundred. It's possible to make bulk changes and suppress e-mail though, which could be done for all but b. Better to suppress the emails when making such changes? or just not bother on some of these? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Welcoming three new committers
Congrats Cheng, Joseph and Owen! Well done! Thanks, Hari On Tue, Feb 3, 2015 at 2:55 PM, Ted Yu yuzhih...@gmail.com wrote: Congratulations, Cheng, Joseph and Sean. On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Congratulations guys! On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley and Sean Owen. All three have been major contributors to Spark in the past year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces throughout Spark Core. Join me in welcoming them as committers! Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Which committers care about Kafka?
In general such discussions happen or is posted on the dev lists. Could you please post a summary? Thanks. Thanks, Hari On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org wrote: After a long talk with Patrick and TD (thanks guys), I opened the following jira https://issues.apache.org/jira/browse/SPARK-4964 Sample PR has an impementation for the batch and the dstream case, and a link to a project with example usage. On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote: yup, we at tresata do the idempotent store the same way. very simple approach. On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org wrote: That KafkaRDD code is dead simple. Given a user specified map (topic1, partition0) - (startingOffset, endingOffset) (topic1, partition1) - (startingOffset, endingOffset) ... turn each one of those entries into a partition of an rdd, using the simple consumer. That's it. No recovery logic, no state, nothing - for any failures, bail on the rdd and let it retry. Spark stays out of the business of being a distributed database. The client code does any transformation it wants, then stores the data and offsets. There are two ways of doing this, either based on idempotence or a transactional data store. For idempotent stores: 1.manipulate data 2.save data to store 3.save ending offsets to the same store If you fail between 2 and 3, the offsets haven't been stored, you start again at the same beginning offsets, do the same calculations in the same order, overwrite the same data, all is good. For transactional stores: 1. manipulate data 2. begin transaction 3. save data to the store 4. save offsets 5. commit transaction If you fail before 5, the transaction rolls back. To make this less heavyweight, you can write the data outside the transaction and then update a pointer to the current data inside the transaction. Again, spark has nothing much to do with guaranteeing exactly once. In fact, the current streaming api actively impedes my ability to do the above. I'm just suggesting providing an api that doesn't get in the way of exactly-once. On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Can you explain your basic algorithm for the once-only-delivery? It is quite a bit of very Kafka-specific code, that would take more time to read than I can currently afford? If you can explain your algorithm a bit, it might help. Thanks, Hari On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org wrote: The problems you guys are discussing come from trying to store state in spark, so don't do that. Spark isn't a distributed database. Just map kafka partitions directly to rdds, llet user code specify the range of offsets explicitly, and let them be in charge of committing offsets. Using the simple consumer isn't that bad, I'm already using this in production with the code I linked to, and tresata apparently has been as well. Again, for everyone saying this is impossible, have you read either of those implementations and looked at the approach? On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Please feel free to correct me if I’m wrong, but I think the exactly once spark streaming semantics can easily be solved using updateStateByKey. Make the key going into updateStateByKey be a hash of the event, or pluck off some uuid from the message. The updateFunc would only emit the message if the key did not exist, and the user has complete control over the window of time / state lifecycle for detecting duplicates. It also makes it really easy to detect and take action (alert?) when you DO see a duplicate, or make memory tradeoffs within an error bound using a sketch algorithm. The kafka simple consumer is insanely complex, if possible I think it would be better (and vastly more flexible) to get reliability using the primitives that spark so elegantly provides. Cheers, Sean On Dec 19, 2014, at 12:06 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Hi Dibyendu, Thanks for the details on the implementation. But I still do not believe that it is no duplicates - what they achieve is that the same batch is processed exactly the same way every time (but see it may be processed more than once) - so it depends on the operation being idempotent. I believe Trident uses ZK to keep track of the transactions - a batch can be processed multiple times in failure scenarios (for example, the transaction is processed but before ZK is updated the machine fails, causing a new node to process it again). I don't think it is impossible to do this in Spark Streaming as well and I'd be really interested in working on it at some point in the near future. On Fri, Dec 19, 2014 at 1
Re: Which committers care about Kafka?
Hi Dibyendu, Thanks for the details on the implementation. But I still do not believe that it is no duplicates - what they achieve is that the same batch is processed exactly the same way every time (but see it may be processed more than once) - so it depends on the operation being idempotent. I believe Trident uses ZK to keep track of the transactions - a batch can be processed multiple times in failure scenarios (for example, the transaction is processed but before ZK is updated the machine fails, causing a new node to process it again). I don't think it is impossible to do this in Spark Streaming as well and I'd be really interested in working on it at some point in the near future. On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi, Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm Trident has done the exact-once guarantee by processing the tuple in a batch and assigning same transaction-id for a given batch . The replay for a given batch with a transaction-id will have exact same set of tuples and replay of batches happen in exact same order before the failure. Having this paradigm, if downstream system process data for a given batch for having a given transaction-id , and if during failure if same batch is again emitted , you can check if same transaction-id is already processed or not and hence can guarantee exact once semantics. And this can only be achieved in Spark if we use Low Level Kafka consumer API to process the offsets. This low level Kafka Consumer ( https://github.com/dibbhatt/kafka-spark-consumer) has implemented the Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka related logic has been taken from Storm-Kafka spout and which manages all Kafka re-balance and fault tolerant aspects and Kafka metadata managements. Presently this Consumer maintains that during Receiver failure, it will re-emit the exact same Block with same set of messages . Every message have the details of its partition, offset and topic related details which can tackle the SPARK-3146. As this Low Level consumer has complete control over the Kafka Offsets , we can implement Trident like feature on top of it like having implement a transaction-id for a given block , and re-emit the same block with same set of message during Driver failure. Regards, Dibyendu On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I agree with Hari that Strong exact-once semantics is very hard to guarantee, especially in the failure situation. From my understanding even current implementation of ReliableKafkaReceiver cannot fully guarantee the exact once semantics once failed, first is the ordering of data replaying from last checkpoint, this is hard to guarantee when multiple partitions are injected in; second is the design complexity of achieving this, you can refer to the Kafka Spout in Trident, we have to dig into the very details of Kafka metadata management system to achieve this, not to say rebalance and fault-tolerance. Thanks Jerry -Original Message- From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com] Sent: Friday, December 19, 2014 5:57 AM To: Cody Koeninger Cc: Hari Shreedharan; Patrick Wendell; dev@spark.apache.org Subject: Re: Which committers care about Kafka? But idempotency is not that easy t achieve sometimes. A strong only once semantic through a proper API would be superuseful; but I'm not implying this is easy to achieve. On 18 Dec 2014 21:52, Cody Koeninger c...@koeninger.org wrote: If the downstream store for the output data is idempotent or transactional, and that downstream store also is the system of record for kafka offsets, then you have exactly-once semantics. Commit offsets with / after the data is stored. On any failure, restart from the last committed offsets. Yes, this approach is biased towards the etl-like use cases rather than near-realtime-analytics use cases. On Thu, Dec 18, 2014 at 3:27 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: I get what you are saying. But getting exactly once right is an extremely hard problem - especially in presence of failure. The issue is failures can happen in a bunch of places. For example, before the notification of downstream store being successful reaches the receiver that updates the offsets, the node fails. The store was successful, but duplicates came in either way. This is something worth discussing by itself - but without uuids etc this might not really be solved even when you think it is. Anyway, I will look at the links. Even I am interested in all of the features you mentioned - no HDFS WAL for Kafka and once-only delivery, but I doubt the latter is really possible to guarantee - though I really would love to have that! Thanks, Hari On Thu, Dec 18, 2014
Re: Which committers care about Kafka?
Can you explain your basic algorithm for the once-only-delivery? It is quite a bit of very Kafka-specific code, that would take more time to read than I can currently afford? If you can explain your algorithm a bit, it might help. Thanks, Hari On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org wrote: The problems you guys are discussing come from trying to store state in spark, so don't do that. Spark isn't a distributed database. Just map kafka partitions directly to rdds, llet user code specify the range of offsets explicitly, and let them be in charge of committing offsets. Using the simple consumer isn't that bad, I'm already using this in production with the code I linked to, and tresata apparently has been as well. Again, for everyone saying this is impossible, have you read either of those implementations and looked at the approach? On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Please feel free to correct me if I’m wrong, but I think the exactly once spark streaming semantics can easily be solved using updateStateByKey. Make the key going into updateStateByKey be a hash of the event, or pluck off some uuid from the message. The updateFunc would only emit the message if the key did not exist, and the user has complete control over the window of time / state lifecycle for detecting duplicates. It also makes it really easy to detect and take action (alert?) when you DO see a duplicate, or make memory tradeoffs within an error bound using a sketch algorithm. The kafka simple consumer is insanely complex, if possible I think it would be better (and vastly more flexible) to get reliability using the primitives that spark so elegantly provides. Cheers, Sean On Dec 19, 2014, at 12:06 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Hi Dibyendu, Thanks for the details on the implementation. But I still do not believe that it is no duplicates - what they achieve is that the same batch is processed exactly the same way every time (but see it may be processed more than once) - so it depends on the operation being idempotent. I believe Trident uses ZK to keep track of the transactions - a batch can be processed multiple times in failure scenarios (for example, the transaction is processed but before ZK is updated the machine fails, causing a new node to process it again). I don't think it is impossible to do this in Spark Streaming as well and I'd be really interested in working on it at some point in the near future. On Fri, Dec 19, 2014 at 1:44 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi, Thanks to Jerry for mentioning the Kafka Spout for Trident. The Storm Trident has done the exact-once guarantee by processing the tuple in a batch and assigning same transaction-id for a given batch . The replay for a given batch with a transaction-id will have exact same set of tuples and replay of batches happen in exact same order before the failure. Having this paradigm, if downstream system process data for a given batch for having a given transaction-id , and if during failure if same batch is again emitted , you can check if same transaction-id is already processed or not and hence can guarantee exact once semantics. And this can only be achieved in Spark if we use Low Level Kafka consumer API to process the offsets. This low level Kafka Consumer ( https://github.com/dibbhatt/kafka-spark-consumer) has implemented the Spark Kafka consumer which uses Kafka Low Level APIs . All of the Kafka related logic has been taken from Storm-Kafka spout and which manages all Kafka re-balance and fault tolerant aspects and Kafka metadata managements. Presently this Consumer maintains that during Receiver failure, it will re-emit the exact same Block with same set of messages . Every message have the details of its partition, offset and topic related details which can tackle the SPARK-3146. As this Low Level consumer has complete control over the Kafka Offsets , we can implement Trident like feature on top of it like having implement a transaction-id for a given block , and re-emit the same block with same set of message during Driver failure. Regards, Dibyendu On Fri, Dec 19, 2014 at 7:33 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I agree with Hari that Strong exact-once semantics is very hard to guarantee, especially in the failure situation. From my understanding even current implementation of ReliableKafkaReceiver cannot fully guarantee the exact once semantics once failed, first is the ordering of data replaying from last checkpoint, this is hard to guarantee when multiple partitions are injected in; second is the design complexity of achieving this, you can refer to the Kafka Spout in Trident, we have to dig into the very details of Kafka
Re: Which committers care about Kafka?
I get what you are saying. But getting exactly once right is an extremely hard problem - especially in presence of failure. The issue is failures can happen in a bunch of places. For example, before the notification of downstream store being successful reaches the receiver that updates the offsets, the node fails. The store was successful, but duplicates came in either way. This is something worth discussing by itself - but without uuids etc this might not really be solved even when you think it is. Anyway, I will look at the links. Even I am interested in all of the features you mentioned - no HDFS WAL for Kafka and once-only delivery, but I doubt the latter is really possible to guarantee - though I really would love to have that! Thanks, Hari On Thu, Dec 18, 2014 at 12:26 PM, Cody Koeninger c...@koeninger.org wrote: Thanks for the replies. Regarding skipping WAL, it's not just about optimization. If you actually want exactly-once semantics, you need control of kafka offsets as well, including the ability to not use zookeeper as the system of record for offsets. Kafka already is a reliable system that has strong ordering guarantees (within a partition) and does not mandate the use of zookeeper to store offsets. I think there should be a spark api that acts as a very simple intermediary between Kafka and the user's choice of downstream store. Take a look at the links I posted - if there's already been 2 independent implementations of the idea, chances are it's something people need. On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Hi Cody, I am an absolute +1 on SPARK-3146. I think we can implement something pretty simple and lightweight for that one. For the Kafka DStream skipping the WAL implementation - this is something I discussed with TD a few weeks ago. Though it is a good idea to implement this to avoid unnecessary HDFS writes, it is an optimization. For that reason, we must be careful in implementation. There are a couple of issues that we need to ensure works properly - specifically ordering. To ensure we pull messages from different topics and partitions in the same order after failure, we’d still have to persist the metadata to HDFS (or some other system) - this metadata must contain the order of messages consumed, so we know how to re-read the messages. I am planning to explore this once I have some time (probably in Jan). In addition, we must also ensure bucketing functions work fine as well. I will file a placeholder jira for this one. I also wrote an API to write data back to Kafka a while back - https://github.com/apache/spark/pull/2994 . I am hoping that this will get pulled in soon, as this is something I know people want. I am open to feedback on that - anything that I can do to make it better. Thanks, Hari On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Cody, Thanks for reaching out with this. The lead on streaming is TD - he is traveling this week though so I can respond a bit. To the high level point of whether Kafka is important - it definitely is. Something like 80% of Spark Streaming deployments (anecdotally) ingest data from Kafka. Also, good support for Kafka is something we generally want in Spark and not a library. In some cases IIRC there were user libraries that used unstable Kafka API's and we were somewhat waiting on Kafka to stabilize them to merge things upstream. Otherwise users wouldn't be able to use newer Kakfa versions. This is a high level impression only though, I haven't talked to TD about this recently so it's worth revisiting given the developments in Kafka. Please do bring things up like this on the dev list if there are blockers for your usage - thanks for pinging it. - Patrick On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger c...@koeninger.org wrote: Now that 1.2 is finalized... who are the go-to people to get some long-standing Kafka related issues resolved? The existing api is not sufficiently safe nor flexible for our production use. I don't think we're alone in this viewpoint, because I've seen several different patches and libraries to fix the same things we've been running into. Regarding flexibility https://issues.apache.org/jira/browse/SPARK-3146 has been outstanding since August, and IMHO an equivalent of this is absolutely necessary. We wrote a similar patch ourselves, then found that PR and have been running it in production. We wouldn't be able to get our jobs done without it. It also allows users to solve a whole class of problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, etc). Regarding safety, I understand the motivation behind WriteAheadLog as a general solution for streaming unreliable sources, but Kafka already is a reliable source. I think there's a need for an api that treats it as such. Even aside from the performance
Re: Has anyone else observed this build break?
Seems like a comment on that page mentions a fix, which would add yet another profile though — specifically telling mvn that if it is an apple jdk, use the classes.jar as the tools.jar as well, since Apple-packaged JDK 6 bundled them together. Link: http://permalink.gmane.org/gmane.comp.java.maven-plugins.mojo.user/4320 I didn’t test it, but maybe this can fix it? Thanks, Hari On Fri, Nov 14, 2014 at 12:21 PM, Patrick Wendell pwend...@gmail.com wrote: A work around for this fix is identified here: http://dbknickerbocker.blogspot.com/2013/04/simple-fix-to-missing-toolsjar-in-jdk.html However, if this affects more users I'd prefer to just fix it properly in our build. On Fri, Nov 14, 2014 at 12:17 PM, Patrick Wendell pwend...@gmail.com wrote: A recent patch broke clean builds for me, I am trying to see how widespread this issue is and whether we need to revert the patch. The error I've seen is this when building the examples project: spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar The reason for this error is that hbase-annotations is using a system scoped dependency in their hbase-annotations pom, and this doesn't work with certain JDK layouts such as that provided on Mac OS: http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom Has anyone else seen this or is it just me? - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark-Submit issues
Yep, you’d need to shade jars to ensure all your dependencies are in the classpath. Thanks, Hari On Wed, Nov 12, 2014 at 3:23 AM, Ted Malaska ted.mala...@cloudera.com wrote: Hey this is Ted Are you using Shade when you build your jar and are you using the bigger jar? Looks like classes are not included in you jar. On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi Hari, Now Iam trying out the same FlumeEventCount example running with spark-submit Instead of run example. The steps I followed is that I have exported the JavaFlumeEventCount.java into jar. The command used is ./bin/spark-submit --jars lib/spark-examples-1.1.0-hadoop1.0.4.jar --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 The output is 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 If I use this command ./bin/spark-submit --master local --class org.JavaFlumeEventCount bin/flumeeventcnt2.jar localhost 2323 Then I get an error Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/examples/streaming/StreamingExamples at org.JavaFlumeEventCount.main(JavaFlumeEventCount.java:22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.streaming.StreamingExamples at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 8 more I Just wanted to ask is that it is able to find spark-assembly.jar but why not spark-example.jar. The next doubt is while running FlumeEventCount example through runexample I get an output as Received 4 flume events. 14/11/12 18:30:14 INFO scheduler.JobScheduler: Finished job streaming job 1415797214000 ms.0 from job set of time 1415797214000 ms 14/11/12 18:30:14 INFO rdd.MappedRDD: Removing RDD 70 from persistence list But If I run the same program through Spark-Submit I get an output as 14/11/12 17:55:02 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks 14/11/12 17:55:02 INFO scheduler.JobScheduler: Added jobs for time 1415795102000 So I need a clarification, since in the program the printing statement is written as Received n flume events. So how come Iam able to see as Stream 0 received n blocks. And what is the difference of running the program through spark-submit and run-example. Awaiting for your kind reply Regards, Jeniba Johnson The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: Bind exception while running FlumeEventCount
Looks like that port is not available because another app is using that port. Can you take a look at netstat -a and use a port that is free? Thanks, Hari On Fri, Nov 7, 2014 at 2:05 PM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi, I have installed spark-1.1.0 and apache flume 1.4 for running streaming example FlumeEventCount. Previously the code was working fine. Now Iam facing with the below mentioned issues. My flume is running properly it is able to write the file. The command I use is bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 172.29.17.178 65001 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 14/11/07 23:19:23 INFO flume.FlumeReceiver: Flume receiver stopped 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Deregistering receiver 0 14/11/07 23:19:23 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:68) at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:164) at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:171) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:344) at sun.nio.ch.Net.bind(Net.java:336) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290) at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) ... 3 more 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopping BlockGenerator 14/11/07 23:19:23 INFO util.RecurringTimer: Stopped timer for BlockGenerator after time 1415382563200 14/11/07 23:19:23 INFO receiver.BlockGenerator: Waiting for block pushing thread 14/11/07 23:19:23 INFO receiver.BlockGenerator: Pushing out the last 0 blocks 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped block pushing thread 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped BlockGenerator 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Waiting for executor stop is over 14/11/07 23:19:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 14/11/07 23:19:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119) at
RE: Bind exception while running FlumeEventCount
First, can you try a different port? TIME_WAIT is basically a timeout for a socket to be completely decommissioned for the port to be available for binding. Once you wait for a few minutes and if you still see a startup issue, can you also send the error logs? From what I can see, the port seems to be in use. Thanks, Hari
Re: Bind exception while running FlumeEventCount
Did you start a Flume agent to push data to the relevant port? Thanks, Hari On Fri, Nov 7, 2014 at 2:05 PM, Jeniba Johnson jeniba.john...@lntinfotech.com wrote: Hi, I have installed spark-1.1.0 and apache flume 1.4 for running streaming example FlumeEventCount. Previously the code was working fine. Now Iam facing with the below mentioned issues. My flume is running properly it is able to write the file. The command I use is bin/run-example org.apache.spark.examples.streaming.FlumeEventCount 172.29.17.178 65001 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 14/11/07 23:19:23 INFO flume.FlumeReceiver: Flume receiver stopped 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Called receiver onStop 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Deregistering receiver 0 14/11/07 23:19:23 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:68) at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:164) at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:171) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:344) at sun.nio.ch.Net.bind(Net.java:336) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:199) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290) at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) ... 3 more 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Stopped receiver 0 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopping BlockGenerator 14/11/07 23:19:23 INFO util.RecurringTimer: Stopped timer for BlockGenerator after time 1415382563200 14/11/07 23:19:23 INFO receiver.BlockGenerator: Waiting for block pushing thread 14/11/07 23:19:23 INFO receiver.BlockGenerator: Pushing out the last 0 blocks 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped block pushing thread 14/11/07 23:19:23 INFO receiver.BlockGenerator: Stopped BlockGenerator 14/11/07 23:19:23 INFO receiver.ReceiverSupervisorImpl: Waiting for executor stop is over 14/11/07 23:19:23 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 14/11/07 23:19:23 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.jboss.netty.channel.ChannelException: Failed to bind to: /172.29.17.178:65001 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:106) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:119) at org.apache.avro.ipc.NettyServer.init(NettyServer.java:74) at
Re: [VOTE] Designating maintainers for some Spark components
In Cloudstack, I believe one becomes a maintainer first for a subset of modules, before he/she becomes a proven maintainter who has commit rights on the entire source tree. So would it make sense to go that route, and have committers voted in as maintainers for certain parts of the codebase and then eventually become proven maintainers (though this might have be honor code based, since I don’t think git allows per module commit rights). Thanks, Hari On Thu, Nov 6, 2014 at 3:45 PM, Patrick Wendell pwend...@gmail.com wrote: I think new committers might or might not be maintainers (it would depend on the PMC vote). I don't think it would affect what you could merge, you can merge in any part of the source tree, you just need to get sign off if you want to touch a public API or make major architectural changes. Most projects already require code review from other committers before you commit something, so it's just a version of that where you have specific people appointed to specific components for review. If you look, most large software projects have a maintainer model, both in Apache and outside of it. Cloudstack is probably the best example in Apache since they are the second most active project (roughly) after Spark. They have two levels of maintainers and much strong language - their language: In general, maintainers only have commit rights on the module for which they are responsible.. I'd like us to start with something simpler and lightweight as proposed here. Really the proposal on the table is just to codify the current de-facto process to make sure we stick by it as we scale. If we want to add more formality to it or strictness, we can do it later. - Patrick On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: How would this model work with a new committer who gets voted in? Does it mean that a new committer would be a maintainer for at least one area -- else we could end up having committers who really can't merge anything significant until he becomes a maintainer. Thanks, Hari On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I think you're misunderstanding the idea of process here. The point of process is to make sure something happens automatically, which is useful to ensure a certain level of quality. For example, all our patches go through Jenkins, and nobody will make the mistake of merging them if they fail tests, or RAT checks, or API compatibility checks. The idea is to get the same kind of automation for design on these components. This is a very common process for large software projects, and it's essentially what we had already, but formalizing it will make clear that this is the process we want. It's important to do it early in order to be able to refine the process as the project grows. In terms of scope, again, the maintainers are *not* going to be the only reviewers for that component, they are just a second level of sign-off required for architecture and API. Being a maintainer is also not a promotion, it's a responsibility. Since we don't have much experience yet with this model, I didn't propose automatic rules beyond that the PMC can add / remove maintainers -- presumably the PMC is in the best position to know what the project needs. I think automatic rules are exactly the kind of process you're arguing against. The process here is about ensuring certain checks are made for every code change, not about automating personnel and development decisions. In any case, I appreciate your input on this, and we're going to evaluate the model to see how it goes. It might be that we decide we don't want it at all. However, from what I've seen of other projects (not Hadoop but projects with an order of magnitude more contributors, like Python or Linux), this is one of the best ways to have consistently great releases with a large contributor base and little room for error. With all due respect to what Hadoop's accomplished, I wouldn't use Hadoop as the best example to strive for; in my experience there I've seen patches reverted because of architectural disagreements, new APIs released and abandoned, and generally an experience that's been painful for users. A lot of the decisions we've made in Spark (e.g. time-based release cycle, built-in libraries, API stability rules, etc) were based on lessons learned there, in an attempt to define a better model. Matei On Nov 6, 2014, at 2:18 PM, bc Wong bcwal...@cloudera.com wrote: On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: snip Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but little about machine learning, some are algorithmic whizzes but may
Re: Build fails on master (f90ad5d)
I have seen this on sbt sometimes. I usually do an sbt clean and that fixes it. Thanks, Hari On Tue, Nov 4, 2014 at 3:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FWIW, the official build instructions are here: https://github.com/apache/spark#building-spark On Tue, Nov 4, 2014 at 5:11 PM, Ted Yu yuzhih...@gmail.com wrote: I built based on this commit today and the build was successful. What command did you use ? Cheers On Tue, Nov 4, 2014 at 2:08 PM, Alessandro Baretta alexbare...@gmail.com wrote: Fellow Sparkers, I am new here and still trying to learn to crawl. Please, bear with me. I just pulled f90ad5d from https://github.com/apache/spark.git and am running the compile command in the sbt shell. This is the error I'm seeing: [error] /home/alex/git/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala:32: object sql is not a member of package org.apache.spark [error] import org.apache.spark.sql.catalyst.types._ [error] ^ Am I doing something obscenely stupid is the build genuinely broken? Alex
Re: Moving PR Builder to mvn
I have zinc server running on my mac, and I see maven compilation to be much better than before I had it running. Is the sbt build still faster (sorry, long time since I did a build with sbt). Thanks, Hari On Fri, Oct 24, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote: Overall I think this would be a good idea. The main blocker is just that I think the Maven build is much slower right now than the SBT build. However, if we were able to e.g. parallelize the test build on Jenkins that might make up for it. I'd actually like to have a trigger where we could tests pull requests with either one. - Patrick On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Over the last few months, it seems like we have selected Maven to be the official build system for Spark. I realize that removing the sbt build may not be easy, but it might be a good idea to start looking into that. We had issues over the past few days where mvn builds were fine, while sbt was failing to resolve dependencies which were test-jars causing compilation of certain tests to fail. As a first step, I am wondering if it might be a good idea to change the PR builder to mvn and test PRs consistent with the way we test releases. I am not sure how technically feasible it is, but it would be a start to standardizing on one build system. Thanks, Hari
Re: Moving PR Builder to mvn
+1. From what I can see, it definitely does - though I must say I rarely do full end to end builds though. Maybe worth running as an experiment? Thanks, Hari On Fri, Oct 24, 2014 at 2:34 PM, Stephen Boesch java...@gmail.com wrote: Zinc absolutely helps - feels like makes builds more than twice as fast - both on Mac and Linux. It helps both on fresh and existing builds. 2014-10-24 14:06 GMT-07:00 Patrick Wendell pwend...@gmail.com: Does Zinc still help if you are just running a single totally fresh build? For the pull request builder we purge all state from previous builds. - Patrick On Fri, Oct 24, 2014 at 1:55 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: I have zinc server running on my mac, and I see maven compilation to be much better than before I had it running. Is the sbt build still faster (sorry, long time since I did a build with sbt). Thanks, Hari On Fri, Oct 24, 2014 at 1:46 PM, Patrick Wendell pwend...@gmail.com wrote: Overall I think this would be a good idea. The main blocker is just that I think the Maven build is much slower right now than the SBT build. However, if we were able to e.g. parallelize the test build on Jenkins that might make up for it. I'd actually like to have a trigger where we could tests pull requests with either one. - Patrick On Fri, Oct 24, 2014 at 1:39 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: Over the last few months, it seems like we have selected Maven to be the official build system for Spark. I realize that removing the sbt build may not be easy, but it might be a good idea to start looking into that. We had issues over the past few days where mvn builds were fine, while sbt was failing to resolve dependencies which were test-jars causing compilation of certain tests to fail. As a first step, I am wondering if it might be a good idea to change the PR builder to mvn and test PRs consistent with the way we test releases. I am not sure how technically feasible it is, but it would be a start to standardizing on one build system. Thanks, Hari - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Building and Running Spark on OS X
The sbt executable that is in the spark repo can be used to build sbt without any other set up (it will download the sbt jars etc). Thanks, Hari On Mon, Oct 20, 2014 at 5:16 PM, Sean Owen so...@cloudera.com wrote: Maven is at least built in to OS X (well, with dev tools). You don't even have to brew install it. Surely SBT isn't in the dev tools even? I recall I had to install it. I'd be surprised to hear it required zero setup. On Mon, Oct 20, 2014 at 8:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, I would use sbt too, but I thought if I wanted to publish a little reference page for OS X users then I probably should use the “official https://github.com/apache/spark#building-spark“ build instructions. Nick On Mon, Oct 20, 2014 at 8:00 PM, Reynold Xin r...@databricks.com wrote: I usually use SBT on Mac and that one doesn't require any setup ... On Mon, Oct 20, 2014 at 4:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If one were to put together a short but comprehensive guide to setting up Spark to run locally on OS X, would it look like this? # Install Maven. On OS X, we suggest using Homebrew. brew install maven # Set some important Java and Maven environment variables.export JAVA_HOME=$(/usr/libexec/java_home)export MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m # Go to where you downloaded the Spark source.cd ./spark # Build, configure slaves, and startup Spark. mvn -DskipTests clean packageecho localhost ./conf/slaves ./sbin/start-all.sh # Rock 'n' Roll. ./bin/pyspark # Cleanup when you're done. ./sbin/stop-all.sh Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: reference to dstream in package org.apache.spark.streaming which is not available
Sean - I think only the ones in 1726 are enough. It is weird that any class that uses the test-jar actually requires the streaming jar to be added explicitly. Shouldn't maven take care of this? I posted some comments on the PR. -- Thanks, Hari Sean Owen mailto:so...@cloudera.com August 22, 2014 at 3:58 PM Yes, master hasn't compiled for me for a few days. It's fixed in: https://github.com/apache/spark/pull/1726 https://github.com/apache/spark/pull/2075 Could a committer sort this out? Sean - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org Ted Yu mailto:yuzhih...@gmail.com August 22, 2014 at 1:55 PM Hi, Using the following command on (refreshed) master branch: mvn clean package -DskipTests I got: constituent[36]: file:/homes/hortonzy/apache-maven-3.1.1/conf/logging/ --- java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356) Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in TestSuiteBase.class refers to term dstream in package org.apache.spark.streaming which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling TestSuiteBase.class. at scala.reflect.internal.pickling.UnPickler$Scan.toTypeError(UnPickler.scala:847) at scala.reflect.internal.pickling.UnPickler$Scan$LazyTypeRef.complete(UnPickler.scala:854) at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1231) at scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280) at scala.reflect.internal.Types$TypeMap$$anonfun$noChangeToSymbols$1.apply(Types.scala:4280) at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70) at scala.collection.immutable.List.forall(List.scala:84) at scala.reflect.internal.Types$TypeMap.noChangeToSymbols(Types.scala:4280) at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4293) at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4196) at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638) at scala.reflect.internal.Types$TypeMap.mapOver(Types.scala:4202) at scala.reflect.internal.Types$AsSeenFromMap.apply(Types.scala:4638) at scala.reflect.internal.Types$Type.asSeenFrom(Types.scala:754) at scala.reflect.internal.Types$Type.memberInfo(Types.scala:773) at xsbt.ExtractAPI.defDef(ExtractAPI.scala:224) at xsbt.ExtractAPI.xsbt$ExtractAPI$$definition(ExtractAPI.scala:315) at xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296) at xsbt.ExtractAPI$$anonfun$xsbt$ExtractAPI$$processDefinitions$1.apply(ExtractAPI.scala:296) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108) at xsbt.ExtractAPI.xsbt$ExtractAPI$$processDefinitions(ExtractAPI.scala:296) at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293) at xsbt.ExtractAPI$$anonfun$mkStructure$4.apply(ExtractAPI.scala:293) at xsbt.Message$$anon$1.apply(Message.scala:8) at xsbti.SafeLazy$$anonfun$apply$1.apply(SafeLazy.scala:8) at xsbti.SafeLazy$Impl._t$lzycompute(SafeLazy.scala:20) at xsbti.SafeLazy$Impl._t(SafeLazy.scala:18) at xsbti.SafeLazy$Impl.get(SafeLazy.scala:24) at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138) at xsbt.ExtractAPI$$anonfun$forceStructures$1.apply(ExtractAPI.scala:138) at scala.collection.immutable.List.foreach(List.scala:318) at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:138) at xsbt.ExtractAPI.forceStructures(ExtractAPI.scala:139) at xsbt.API$ApiPhase.processScalaUnit(API.scala:54) at xsbt.API$ApiPhase.processUnit(API.scala:38) at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34) at xsbt.API$ApiPhase$$anonfun$run$1.apply(API.scala:34) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
Re: Spark Avro Generation
Jay running sbt compile or assembly should generate the sources. On Monday, August 11, 2014, Devl Devel devl.developm...@gmail.com wrote: Hi So far I've been managing to build Spark from source but since a change in spark-streaming-flume I have no idea how to generate classes (e.g. SparkFlumeProtocol) from the avro schema. I have used sbt to run avro:generate (from the top level spark dir) but it produces nothing - it just says: avro:generate [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM. Please can someone send me their build.sbt or just tell me how to build spark so that all avro files get generated as well? Sorry for the noob question but I really have tried by best on this one! Cheers
Re: sbt/sbt test steals window focus on OS X
Add this to your .bash_profile (or .bashrc) - that will fix it. export _JAVA_OPTIONS=-Djava.awt.headless=true Hari On Sun, Jul 20, 2014 at 1:56 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I just created SPARK-2602 https://issues.apache.org/jira/browse/SPARK-2602 to track this issue. Are there others who can confirm this is an issue? Also, does this issue extend to other OSes? Nick