Re: Which committers care about Kafka?

2014-12-25 Thread Cody Koeninger
The conversation was mostly getting TD up to speed on this thread since he
had just gotten back from his trip and hadn't seen it.

The jira has a summary of the requirements we discussed, I'm sure TD or
Patrick can add to the ticket if I missed something.
On Dec 25, 2014 1:54 AM, Hari Shreedharan hshreedha...@cloudera.com
wrote:

 In general such discussions happen or is posted on the dev lists. Could
 you please post a summary? Thanks.

 Thanks,
 Hari


 On Wed, Dec 24, 2014 at 11:46 PM, Cody Koeninger c...@koeninger.org
 wrote:

  After a long talk with Patrick and TD (thanks guys), I opened the
 following jira

 https://issues.apache.org/jira/browse/SPARK-4964

 Sample PR has an impementation for the batch and the dstream case, and a
 link to a project with example usage.

 On Fri, Dec 19, 2014 at 4:36 PM, Koert Kuipers ko...@tresata.com wrote:

 yup, we at tresata do the idempotent store the same way. very simple
 approach.

 On Fri, Dec 19, 2014 at 5:32 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That KafkaRDD code is dead simple.

 Given a user specified map

 (topic1, partition0) - (startingOffset, endingOffset)
 (topic1, partition1) - (startingOffset, endingOffset)
 ...
 turn each one of those entries into a partition of an rdd, using the
 simple
 consumer.
 That's it.  No recovery logic, no state, nothing - for any failures,
 bail
 on the rdd and let it retry.
 Spark stays out of the business of being a distributed database.

 The client code does any transformation it wants, then stores the data
 and
 offsets.  There are two ways of doing this, either based on idempotence
 or
 a transactional data store.

 For idempotent stores:

 1.manipulate data
 2.save data to store
 3.save ending offsets to the same store

 If you fail between 2 and 3, the offsets haven't been stored, you start
 again at the same beginning offsets, do the same calculations in the
 same
 order, overwrite the same data, all is good.


 For transactional stores:

 1. manipulate data
 2. begin transaction
 3. save data to the store
 4. save offsets
 5. commit transaction

 If you fail before 5, the transaction rolls back.  To make this less
 heavyweight, you can write the data outside the transaction and then
 update
 a pointer to the current data inside the transaction.


 Again, spark has nothing much to do with guaranteeing exactly once.  In
 fact, the current streaming api actively impedes my ability to do the
 above.  I'm just suggesting providing an api that doesn't get in the
 way of
 exactly-once.





 On Fri, Dec 19, 2014 at 3:57 PM, Hari Shreedharan 
 hshreedha...@cloudera.com
  wrote:

  Can you explain your basic algorithm for the once-only-delivery? It is
  quite a bit of very Kafka-specific code, that would take more time to
 read
  than I can currently afford? If you can explain your algorithm a bit,
 it
  might help.
 
  Thanks,
  Hari
 
 
  On Fri, Dec 19, 2014 at 1:48 PM, Cody Koeninger c...@koeninger.org
  wrote:
 
 
  The problems you guys are discussing come from trying to store state
 in
  spark, so don't do that.  Spark isn't a distributed database.
 
  Just map kafka partitions directly to rdds, llet user code specify
 the
  range of offsets explicitly, and let them be in charge of committing
  offsets.
 
  Using the simple consumer isn't that bad, I'm already using this in
  production with the code I linked to, and tresata apparently has
 been as
  well.  Again, for everyone saying this is impossible, have you read
 either
  of those implementations and looked at the approach?
 
 
 
  On Fri, Dec 19, 2014 at 2:27 PM, Sean McNamara 
  sean.mcnam...@webtrends.com wrote:
 
  Please feel free to correct me if I’m wrong, but I think the exactly
  once spark streaming semantics can easily be solved using
 updateStateByKey.
  Make the key going into updateStateByKey be a hash of the event, or
 pluck
  off some uuid from the message.  The updateFunc would only emit the
 message
  if the key did not exist, and the user has complete control over
 the window
  of time / state lifecycle for detecting duplicates.  It also makes
 it
  really easy to detect and take action (alert?) when you DO see a
 duplicate,
  or make memory tradeoffs within an error bound using a sketch
 algorithm.
  The kafka simple consumer is insanely complex, if possible I think
 it would
  be better (and vastly more flexible) to get reliability using the
  primitives that spark so elegantly provides.
 
  Cheers,
 
  Sean
 
 
   On Dec 19, 2014, at 12:06 PM, Hari Shreedharan 
  hshreedha...@cloudera.com wrote:
  
   Hi Dibyendu,
  
   Thanks for the details on the implementation. But I still do not
  believe
   that it is no duplicates - what they achieve is that the same
 batch is
   processed exactly the same way every time (but see it may be
 processed
  more
   than once) - so it depends on the operation being idempotent. I
 believe
   Trident uses ZK to keep track of the transactions - a batch can be
   processed multiple times in 

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's
hard/tricky to program Spark. I don't know if there is already such a
tool.

Thanks!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Starting with Spark

2014-12-25 Thread Naveen Madhire
Thanks. I will work on this today and try to setup.

The bad link is present in the below github REAMME file,

https://github.com/apache/spark

Search with Build Spark with Maven

On Thu, Dec 25, 2014 at 1:49 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 The correct docs link is:
 https://spark.apache.org/docs/1.2.0/building-spark.html

 Where did you get that bad link from?

 Nick



 On Thu Dec 25 2014 at 12:00:53 AM Naveen Madhire vmadh...@umail.iu.edu
 wrote:

 Hi All,

 I am starting to use Spark. I am having trouble getting the latest code
 from git.
 I am using Intellij as suggested in the below link,

 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#
 ContributingtoSpark-StarterTasks


 The below link isn't working as well,

 http://spark.apache.org/building-spark.html


 Does anyone know any useful links to get spark running on local laptop

 Please help.


 Thanks




Re: Apache Spark (Data Aggregation) using Java API

2014-12-25 Thread slcclimber
You could convert your csv file to an rdd of vectors.
Then use stats from mllib.

Also this should be in the user list not the developer list.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-Data-Aggregation-using-Java-API-tp9902p9924.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Starting with Spark

2014-12-25 Thread Nicholas Chammas
Thanks for the pointer. This will be fixed in this PR
https://github.com/apache/spark/pull/3802.

On Thu Dec 25 2014 at 10:35:20 AM Naveen Madhire vmadh...@umail.iu.edu
wrote:

 Thanks. I will work on this today and try to setup.

 The bad link is present in the below github REAMME file,

 https://github.com/apache/spark

 Search with Build Spark with Maven

 On Thu, Dec 25, 2014 at 1:49 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 The correct docs link is:
 https://spark.apache.org/docs/1.2.0/building-spark.html

 Where did you get that bad link from?

 Nick



 On Thu Dec 25 2014 at 12:00:53 AM Naveen Madhire vmadh...@umail.iu.edu
 wrote:

 Hi All,

 I am starting to use Spark. I am having trouble getting the latest code
 from git.
 I am using Intellij as suggested in the below link,

 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#
 ContributingtoSpark-StarterTasks


 The below link isn't working as well,

 http://spark.apache.org/building-spark.html


 Does anyone know any useful links to get spark running on local laptop

 Please help.


 Thanks





Re: Starting with Spark

2014-12-25 Thread Denny Lee
Thanks for the catch Naveen!

On Thu Dec 25 2014 at 10:47:18 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thanks for the pointer. This will be fixed in this PR
 https://github.com/apache/spark/pull/3802.

 On Thu Dec 25 2014 at 10:35:20 AM Naveen Madhire vmadh...@umail.iu.edu
 wrote:

  Thanks. I will work on this today and try to setup.
 
  The bad link is present in the below github REAMME file,
 
  https://github.com/apache/spark
 
  Search with Build Spark with Maven
 
  On Thu, Dec 25, 2014 at 1:49 AM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  The correct docs link is:
  https://spark.apache.org/docs/1.2.0/building-spark.html
 
  Where did you get that bad link from?
 
  Nick
 
 
 
  On Thu Dec 25 2014 at 12:00:53 AM Naveen Madhire vmadh...@umail.iu.edu
 
  wrote:
 
  Hi All,
 
  I am starting to use Spark. I am having trouble getting the latest code
  from git.
  I am using Intellij as suggested in the below link,
 
  https://cwiki.apache.org/confluence/display/SPARK/
 Contributing+to+Spark#
  ContributingtoSpark-StarterTasks
 
 
  The below link isn't working as well,
 
  http://spark.apache.org/building-spark.html
 
 
  Does anyone know any useful links to get spark running on local laptop
 
  Please help.
 
 
  Thanks