Re: Directly broadcasting (sort of) RDDs

2015-03-20 Thread Nick Pentreath
There is block matrix in Spark 1.3 - 
http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix





However I believe it only supports dense matrix blocks.




Still, might be possible to use it or exetend 




JIRAs:


https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3434





Was based on 


https://github.com/amplab/ml-matrix





Another lib:


https://github.com/PasaLab/marlin/blob/master/README.md







—
Sent from Mailbox

On Sat, Mar 21, 2015 at 12:24 AM, Guillaume Pitel
 wrote:

> Hi,
> I have an idea that I would like to discuss with the Spark devs. The 
> idea comes from a very real problem that I have struggled with since 
> almost a year. My problem is very simple, it's a dense matrix * sparse 
> matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is 
> divided in X large blocks (one block per partition), and a sparse matrix 
> RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The 
> most efficient way to perform the operation is to collectAsMap() the 
> dense matrix and broadcast it, then perform the block-local 
> mutliplications, and combine the results by column.
> This is quite fine, unless the matrix is too big to fit in memory 
> (especially since the multiplication is performed several times 
> iteratively, and the broadcasts are not always cleaned from memory as I 
> would naively expect).
> When the dense matrix is too big, a second solution is to split the big 
> sparse matrix in several RDD, and do several broadcasts. Doing this 
> creates quite a big overhead, but it mostly works, even though I often 
> face some problems with unaccessible broadcast files, for instance.
> Then there is the terrible but apparently very effective good old join. 
> Since X blocks of the sparse matrix use the same block from the dense 
> matrix, I suspect that the dense matrix is somehow replicated X times 
> (either on disk or in the network), which is the reason why the join 
> takes so much time.
> After this bit of a context, here is my idea : would it be possible to 
> somehow "broadcast" (or maybe more accurately, share or serve) a 
> persisted RDD which is distributed on all workers, in a way that would, 
> a bit like the IndexedRDD, allow a task to access a partition or an 
> element of a partition in the closure, with a worker-local memory cache 
> . i.e. the information about where each block resides would be 
> distributed on the workers, to allow them to access parts of the RDD 
> directly. I think that's already a bit how RDD are shuffled ?
> The RDD could stay distributed (no need to collect then broadcast), and 
> only necessary transfers would be required.
> Is this a bad idea, is it already implemented somewhere (I would love it 
> !) ?or is it something that could add efficiency not only for my use 
> case, but maybe for others ? Could someone give me some hint about how I 
> could add this possibility to Spark ? I would probably try to extend a 
> RDD into a specific SharedIndexedRDD with a special lookup that would be 
> allowed from tasks as a special case, and that would try to contact the 
> blockManager and reach the corresponding data from the right worker.
> Thanks in advance for your advices
> Guillaume
> -- 
> eXenSa
>   
> *Guillaume PITEL, Président*
> +33(0)626 222 431
> eXenSa S.A.S. 
> 41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all:




When I exit the console of spark-sql, the following exception throwed..


My spark version is 1.3.0, hadoop version is 2.2.0


Exception in thread "Thread-3" java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1677)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:196)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)‍

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all:




When I exit the console of spark-sql, the following exception throwed..


Exception in thread "Thread-3" java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1677)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:196)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)‍

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-20 Thread Burak Yavuz
Hi,

We plan to add a more comprehensive local linear algebra package for MLlib
1.4. This local linear algebra package can then easily be extended to
BlockMatrix to support the same operations in a distributed fashion.

You may find the JIRA to track this here: SPARK-6442


The design doc is here: http://goo.gl/sf5LCE

We would very much appreciate your feedback and input.

Best,
Burak

On Thu, Mar 19, 2015 at 3:06 PM, Debasish Das 
wrote:

> Yeah it will be better if we consolidate the development on one of
> them...either Breeze or mllib.BLAS...
>
> On Thu, Mar 19, 2015 at 2:25 PM, Ulanov, Alexander <
> alexander.ula...@hp.com>
> wrote:
>
> >  Thanks for quick response.
> >
> >  I can use linealg.BLAS.gemm, and this means that I have to use MLlib
> > Matrix. The latter does not support some useful functionality needed for
> > optimization. For example, creation of Matrix given matrix size, array
> and
> > offset in this array. This means that I will need to create matrix in
> > Breeze and convert it to MLlib. Also, linalg.BLAS misses some useful BLAS
> > functions I need, that can be found in Breeze (and netlib-java). The same
> > concerns are applicable to MLlib Vector.
> >
> > Best regards, Alexander
> >
> > 19.03.2015, в 14:16, "Debasish Das" 
> написал(а):
> >
> >   I think for Breeze we are focused on dot and dgemv right now (along
> > with several other matrix vector style operations)...
> >
> >  For dgemm it is tricky since you need to do add dgemm for both
> > DenseMatrix and CSCMatrix...and for CSCMatrix you need to get something
> > like SuiteSparse which is under lgpl...so we have to think more on it..
> >
> >  For now can't you use dgemm directly from mllib.linalg.BLAS ? It's in
> > master...
> >
> >
> > On Thu, Mar 19, 2015 at 1:49 PM, Ulanov, Alexander <
> > alexander.ula...@hp.com> wrote:
> >
> >>  Thank you! When do you expect to have gemm in Breeze and that version
> >> of Breeze to ship with MLlib?
> >>
> >>  Also, could someone please elaborate on the linalg.BLAS and Matrix? Are
> >> they going to be developed further, should in long term all developers
> use
> >> them?
> >>
> >> Best regards, Alexander
> >>
> >> 18.03.2015, в 23:21, "Debasish Das" 
> >> написал(а):
> >>
> >>dgemm dgemv and dot come to Breeze and Spark through netlib-java
> >>
> >>  Right now both in dot and dgemv Breeze does a extra memory allocate but
> >> we already found the issue and we are working on adding a common trait
> that
> >> will provide a sink operation (basically memory will be allocated by
> >> user)...adding more BLAS operators in breeze will also help in general
> as
> >> lot more operations are defined over there...
> >>
> >>
> >> On Wed, Mar 18, 2015 at 8:09 PM, Ulanov, Alexander <
> >> alexander.ula...@hp.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Currently I am using Breeze within Spark MLlib for linear algebra. I
> >>> would like to reuse previously allocated matrices for storing the
> result of
> >>> matrices multiplication, i.e. I need to use "gemm" function
> C:=q*A*B+p*C,
> >>> which is missing in Breeze (Breeze automatically allocates a new
> matrix to
> >>> store the result of multiplication). Also, I would like to minimize
> gemm
> >>> calls that Breeze does. Should I use mllib.linalg.BLAS functions
> instead?
> >>> While it has gemm and axpy, it has rather limited number of
> operations. For
> >>> example, I need sum of the matrix by row or by columns, or applying a
> >>> function to all elements in a matrix. Also, MLlib Vector and Matrix
> >>> interfaces that linalg.BLAS operates seems to be rather undeveloped.
> Should
> >>> I use plain netlib-java instead (will it remain in MLlib in future
> >>> releases)?
> >>>
> >>> Best regards, Alexander
> >>>
> >>
> >>
> >
>


Re: Error: 'SparkContext' object has no attribute 'getActiveStageIds'

2015-03-20 Thread Ted Yu
Please take a look
at core/src/main/scala/org/apache/spark/SparkStatusTracker.scala, around
line 58:
  def getActiveStageIds(): Array[Int] = {

Cheers

On Fri, Mar 20, 2015 at 3:59 PM, xing  wrote:

> getStageInfo in self._jtracker.getStageInfo below seems not
> implemented/included in the current python library.
>
>def getStageInfo(self, stageId):
> """
> Returns a :class:`SparkStageInfo` object, or None if the stage
> info could not be found or was garbage collected.
> """
> stage = self._jtracker.getStageInfo(stageId)
> if stage is not None:
> # TODO: fetch them in batch for better performance
> attrs = [getattr(stage, f)() for f in
> SparkStageInfo._fields[1:]]
> return SparkStageInfo(stageId, *attrs)
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Error-SparkContext-object-has-no-attribute-getActiveStageIds-tp11136p11140.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Minor Edit in the programming guide

2015-03-20 Thread Joseph Bradley
Thanks!  I added it to a few other items:
https://issues.apache.org/jira/browse/SPARK-6337

On Fri, Mar 20, 2015 at 5:53 PM, Muttineni, Vinay 
wrote:

> Hey guys,
> In the Spark 1.3.0 documentation provided here,
> http://spark.apache.org/docs/latest/sql-programming-guide.html ,
> Under the "Programmatically Specifying the Schema" section , it's
> mentioned that SQL data types are in the following package
> org.apache.spark.sql, but I guess it has changed to
> org.apache.spark.sql.types
> Great work with the Data Frames btw! :)
> Thanks,
> Vinay
>
>
>


Re: Error: 'SparkContext' object has no attribute 'getActiveStageIds'

2015-03-20 Thread xing
getStageInfo in self._jtracker.getStageInfo below seems not
implemented/included in the current python library.

   def getStageInfo(self, stageId):
"""
Returns a :class:`SparkStageInfo` object, or None if the stage
info could not be found or was garbage collected.
"""
stage = self._jtracker.getStageInfo(stageId)
if stage is not None:
# TODO: fetch them in batch for better performance
attrs = [getattr(stage, f)() for f in
SparkStageInfo._fields[1:]]
return SparkStageInfo(stageId, *attrs)



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Error-SparkContext-object-has-no-attribute-getActiveStageIds-tp11136p11140.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Directly broadcasting (sort of) RDDs

2015-03-20 Thread Guillaume Pitel

Hi,

I have an idea that I would like to discuss with the Spark devs. The 
idea comes from a very real problem that I have struggled with since 
almost a year. My problem is very simple, it's a dense matrix * sparse 
matrix  operation. I have a dense matrix RDD[(Int,FloatMatrix)] which is 
divided in X large blocks (one block per partition), and a sparse matrix 
RDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. The 
most efficient way to perform the operation is to collectAsMap() the 
dense matrix and broadcast it, then perform the block-local 
mutliplications, and combine the results by column.


This is quite fine, unless the matrix is too big to fit in memory 
(especially since the multiplication is performed several times 
iteratively, and the broadcasts are not always cleaned from memory as I 
would naively expect).


When the dense matrix is too big, a second solution is to split the big 
sparse matrix in several RDD, and do several broadcasts. Doing this 
creates quite a big overhead, but it mostly works, even though I often 
face some problems with unaccessible broadcast files, for instance.


Then there is the terrible but apparently very effective good old join. 
Since X blocks of the sparse matrix use the same block from the dense 
matrix, I suspect that the dense matrix is somehow replicated X times 
(either on disk or in the network), which is the reason why the join 
takes so much time.


After this bit of a context, here is my idea : would it be possible to 
somehow "broadcast" (or maybe more accurately, share or serve) a 
persisted RDD which is distributed on all workers, in a way that would, 
a bit like the IndexedRDD, allow a task to access a partition or an 
element of a partition in the closure, with a worker-local memory cache 
. i.e. the information about where each block resides would be 
distributed on the workers, to allow them to access parts of the RDD 
directly. I think that's already a bit how RDD are shuffled ?


The RDD could stay distributed (no need to collect then broadcast), and 
only necessary transfers would be required.


Is this a bad idea, is it already implemented somewhere (I would love it 
!) ?or is it something that could add efficiency not only for my use 
case, but maybe for others ? Could someone give me some hint about how I 
could add this possibility to Spark ? I would probably try to extend a 
RDD into a specific SharedIndexedRDD with a special lookup that would be 
allowed from tasks as a special case, and that would try to contact the 
blockManager and reach the corresponding data from the right worker.


Thanks in advance for your advices

Guillaume
--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. 
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Minor Edit in the programming guide

2015-03-20 Thread Muttineni, Vinay
Hey guys,
In the Spark 1.3.0 documentation provided here, 
http://spark.apache.org/docs/latest/sql-programming-guide.html ,
Under the "Programmatically Specifying the Schema" section , it's mentioned 
that SQL data types are in the following package org.apache.spark.sql, but I 
guess it has changed to org.apache.spark.sql.types
Great work with the Data Frames btw! :)
Thanks,
Vinay




Re: Spark SQL ExternalSorter not stopped

2015-03-20 Thread Yin Huai
Hi Michael,

Thanks for reporting it. Yes, it is a bug. I have created
https://issues.apache.org/jira/browse/SPARK-6437 to track it.

Thanks,

Yin

On Thu, Mar 19, 2015 at 10:51 AM, Michael Allman 
wrote:

> I've examined the experimental support for ExternalSorter in Spark SQL,
> and it does not appear that the external sorted is ever stopped
> (ExternalSorter.stop). According to the API documentation, this suggests a
> resource leak. Before I file a bug report in Jira, can someone familiar
> with the codebase confirm this is indeed a bug?
>
> Thanks,
>
> Michael
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Storage of RDDs created via sc.parallelize

2015-03-20 Thread Karlson


Hi all,

where is the data stored that is passed to sc.parallelize? Or put 
differently, where is the data for the base RDD fetched from when the 
DAG is executed, if the base RDD is constructed via sc.parallelize?


I am reading a csv file via the Python csv module and am feeding the 
parsed data chunkwise to sc.parallelize, because the whole file would 
not fit into memory on the driver. Reading the file with sc.textfile 
first is not an option, as there might be linebreaks inside the csv 
fields, preventing me from parsing the file line by line.


The problem I am facing right now is that even though I am feeding only 
one chunk at a time to Spark, I will eventually run out of memory on the 
driver.


Thanks in advance!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Exception using the new createDirectStream util method

2015-03-20 Thread Cody Koeninger
I went ahead and created

https://issues.apache.org/jira/browse/SPARK-6434

to track this

On Fri, Mar 20, 2015 at 2:44 AM, Alberto Rodriguez 
wrote:

> You were absolutely right Cody!! I have just put a message in the kafka
> topic before creating the DirectStream and now is working fine!
>
> Do you think that I should open an issue to warn that the kafka topic must
> contain at least one message before the DirectStream creation?
>
> Thank you very much! You've just made my day ;)
>
> 2015-03-19 23:08 GMT+01:00 Cody Koeninger :
>
> > Yeah, I wouldn't be shocked if Kafka's metadata apis didn't return
> results
> > for topics that don't have any messages.  (sorry about the triple
> negative,
> > but I think you get my meaning).
> >
> > Try putting a message in the topic and seeing what happens.
> >
> > On Thu, Mar 19, 2015 at 4:38 PM, Alberto Rodriguez 
> > wrote:
> >
> >> Thank you for replying,
> >>
> >> Ted, I have been debuging and the getLeaderOffsets method is not
> appending
> >> errors because the method findLeaders that is called at the first line
> of
> >> getLeaderOffsets is not returning leaders.
> >>
> >> Cody, the topics do not have any messages yet. Could this be an issue??
> >>
> >> If you guys want to have a look at the code I've just uploaded it to my
> >> github account: big-brother 
> (see
> >>
> >> DirectKafkaWordCountTest.scala).
> >>
> >> Thank you again!!
> >>
> >> 2015-03-19 22:13 GMT+01:00 Cody Koeninger :
> >>
> >> > What is the value of your topics variable, and does it correspond to
> >> > topics that already exist on the cluster and have messages in them?
> >> >
> >> > On Thu, Mar 19, 2015 at 3:10 PM, Ted Yu  wrote:
> >> >
> >> >> Looking at KafkaCluster#getLeaderOffsets():
> >> >>
> >> >>   respMap.get(tp).foreach { por: PartitionOffsetsResponse =>
> >> >> if (por.error == ErrorMapping.NoError) {
> >> >> ...
> >> >> } else {
> >> >>   errs.append(ErrorMapping.exceptionFor(por.error))
> >> >> }
> >> >> There should be some error other than "Couldn't find leader offsets
> for
> >> >> Set()"
> >> >>
> >> >> Can you check again ?
> >> >>
> >> >> Thanks
> >> >>
> >> >> On Thu, Mar 19, 2015 at 12:10 PM, Alberto Rodriguez <
> ardl...@gmail.com
> >> >
> >> >> wrote:
> >> >>
> >> >> > Hi all,
> >> >> >
> >> >> > I am trying to make the new kafka and spark streaming integration
> >> work
> >> >> > (direct
> >> >> > approach "no receivers"
> >> >> > <
> http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
> >> >).
> >> >> I
> >> >> > have created an unit test where I configure and start both
> zookeeper
> >> and
> >> >> > kafka.
> >> >> >
> >> >> > When I try to create the InputDStream using the createDirectStream
> >> >> method
> >> >> > of the KafkaUtils class I am getting the following error:
> >> >> >
> >> >> > org.apache.spark.SparkException:* Couldn't find leader offsets for
> >> >> Set()*
> >> >> > org.apache.spark.SparkException: org.apache.spark.SparkException:
> >> >> Couldn't
> >> >> > find leader offsets for Set()
> >> >> > at
> >> >> >
> >> >> >
> >> >>
> >>
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
> >> >> >
> >> >> > Following is the code that tries to create the DStream:
> >> >> >
> >> >> > val messages: InputDStream[(String, String)] =
> >> >> > KafkaUtils.createDirectStream[String, String, StringDecoder,
> >> >> > StringDecoder](
> >> >> > ssc, kafkaParams, topics)
> >> >> >
> >> >> > Does anyone faced this problem?
> >> >> >
> >> >> > Thank you in advance.
> >> >> >
> >> >> > Kind regards,
> >> >> >
> >> >> > Alberto
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
> >
>


Re: 答复: Contributing to Spark

2015-03-20 Thread vikas.v.i...@gmail.com
Jessica, thanks for links. I am aware of these but am looking for some ml
related jira issues which I can contribute as starting point.

Thanks,
Vikas
On Mar 20, 2015 2:12 PM, "Tanyinyan"  wrote:

> Hello Vikas,
>
> These two links maybe what you want.
>
> jira:
> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel
>
> pull request:  https://github.com/apache/spark/pulls
>
> Regards,
>
> Jessica
>
>
> -邮件原件-
> 发件人: vikas.v.i...@gmail.com [mailto:vikas.v.i...@gmail.com]
> 发送时间: 2015年3月20日 15:50
> 收件人: dev@spark.apache.org
> 主题: Contributing to Spark
>
> Hi ,
>
> I have read and gone through most Spark tutorials and materials out there.
> I have also downloaded and build the spark code base .
>
> Can someone point me to some existing Jira where I can start contributing ?
> Eventually I want to do some good contribution to mlLib .
>
> Thanks,
> Vikas
>


Connecting a worker to the master after a spark context is made

2015-03-20 Thread Niranda Perera
Hi,

Please consider the following scenario.

I've started the spark master by invoking
the org.apache.spark.deploy.master.Master.startSystemAndActor method in a
java code and connected a worker to it using
the org.apache.spark.deploy.worker.Worker.startSystemAndActor method. and
then I have successfully created a java spark & SQL contexts and performed
SQL queries.

My question is, can I change this order?
Can I start the master first, then create a spark context... and later on
connect a worker to the master?

While trying out this scenario, I have successfully started the master.
Please see the screenshot here.



But when I create an spark context, it terminates automatically. is it
because the master not being connected to a worker?

cheers


-- 
Niranda
​


答复: Contributing to Spark

2015-03-20 Thread Tanyinyan
Hello Vikas,

These two links maybe what you want.

jira:  
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel

pull request:  https://github.com/apache/spark/pulls

Regards,

Jessica


-邮件原件-
发件人: vikas.v.i...@gmail.com [mailto:vikas.v.i...@gmail.com] 
发送时间: 2015年3月20日 15:50
收件人: dev@spark.apache.org
主题: Contributing to Spark

Hi ,

I have read and gone through most Spark tutorials and materials out there.
I have also downloaded and build the spark code base .

Can someone point me to some existing Jira where I can start contributing ?
Eventually I want to do some good contribution to mlLib .

Thanks,
Vikas

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Contributing to Spark

2015-03-20 Thread vikas.v.i...@gmail.com
Hi ,

I have read and gone through most Spark tutorials and materials out there.
I have also downloaded and build the spark code base .

Can someone point me to some existing Jira where I can start contributing ?
Eventually I want to do some good contribution to mlLib .

Thanks,
Vikas


Re: Exception using the new createDirectStream util method

2015-03-20 Thread Alberto Rodriguez
You were absolutely right Cody!! I have just put a message in the kafka
topic before creating the DirectStream and now is working fine!

Do you think that I should open an issue to warn that the kafka topic must
contain at least one message before the DirectStream creation?

Thank you very much! You've just made my day ;)

2015-03-19 23:08 GMT+01:00 Cody Koeninger :

> Yeah, I wouldn't be shocked if Kafka's metadata apis didn't return results
> for topics that don't have any messages.  (sorry about the triple negative,
> but I think you get my meaning).
>
> Try putting a message in the topic and seeing what happens.
>
> On Thu, Mar 19, 2015 at 4:38 PM, Alberto Rodriguez 
> wrote:
>
>> Thank you for replying,
>>
>> Ted, I have been debuging and the getLeaderOffsets method is not appending
>> errors because the method findLeaders that is called at the first line of
>> getLeaderOffsets is not returning leaders.
>>
>> Cody, the topics do not have any messages yet. Could this be an issue??
>>
>> If you guys want to have a look at the code I've just uploaded it to my
>> github account: big-brother  (see
>>
>> DirectKafkaWordCountTest.scala).
>>
>> Thank you again!!
>>
>> 2015-03-19 22:13 GMT+01:00 Cody Koeninger :
>>
>> > What is the value of your topics variable, and does it correspond to
>> > topics that already exist on the cluster and have messages in them?
>> >
>> > On Thu, Mar 19, 2015 at 3:10 PM, Ted Yu  wrote:
>> >
>> >> Looking at KafkaCluster#getLeaderOffsets():
>> >>
>> >>   respMap.get(tp).foreach { por: PartitionOffsetsResponse =>
>> >> if (por.error == ErrorMapping.NoError) {
>> >> ...
>> >> } else {
>> >>   errs.append(ErrorMapping.exceptionFor(por.error))
>> >> }
>> >> There should be some error other than "Couldn't find leader offsets for
>> >> Set()"
>> >>
>> >> Can you check again ?
>> >>
>> >> Thanks
>> >>
>> >> On Thu, Mar 19, 2015 at 12:10 PM, Alberto Rodriguez > >
>> >> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I am trying to make the new kafka and spark streaming integration
>> work
>> >> > (direct
>> >> > approach "no receivers"
>> >> > > >).
>> >> I
>> >> > have created an unit test where I configure and start both zookeeper
>> and
>> >> > kafka.
>> >> >
>> >> > When I try to create the InputDStream using the createDirectStream
>> >> method
>> >> > of the KafkaUtils class I am getting the following error:
>> >> >
>> >> > org.apache.spark.SparkException:* Couldn't find leader offsets for
>> >> Set()*
>> >> > org.apache.spark.SparkException: org.apache.spark.SparkException:
>> >> Couldn't
>> >> > find leader offsets for Set()
>> >> > at
>> >> >
>> >> >
>> >>
>> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
>> >> >
>> >> > Following is the code that tries to create the DStream:
>> >> >
>> >> > val messages: InputDStream[(String, String)] =
>> >> > KafkaUtils.createDirectStream[String, String, StringDecoder,
>> >> > StringDecoder](
>> >> > ssc, kafkaParams, topics)
>> >> >
>> >> > Does anyone faced this problem?
>> >> >
>> >> > Thank you in advance.
>> >> >
>> >> > Kind regards,
>> >> >
>> >> > Alberto
>> >> >
>> >>
>> >
>> >
>>
>
>