Contribution to Apache Spark

2016-09-03 Thread aditya1702
Hello,
I am Aditya Vyas and I am currently in my third year of college doing BTech
in my engineering. I know python, a little bit of Java. I want to start
contribution in Apache Spark. This is my first time in the field of Big
Data. Can someone please help me as to how to get started. Which resources
to look at?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-to-Apache-Spark-tp18852.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Is Spark's KMeans unable to handle bigdata?

2016-09-03 Thread Georgios Samaras
Thank you very much Sean! If you would like, this could serve as an answer
in StackOverflow's question:
[Is Spark's kMeans unable to handle bigdata?](
http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata
).

Enjoy your weekend,
George

On Sat, Sep 3, 2016 at 1:22 AM, Sean Owen  wrote:

> I opened https://issues.apache.org/jira/browse/SPARK-17389 to track
> some improvements, but by far the big one is that the init steps
> defaults to 5, when the paper says that 2 is pretty much optimal here.
> It's much faster with that setting.
>
> On Fri, Sep 2, 2016 at 6:45 PM, Georgios Samaras
>  wrote:
> > I am not using the "runs" parameter anyway, but I see your point. If you
> > could point out any modifications in the minimal example I posted, I
> would
> > be more than interested to try them!
> >
>


Re: Committing Kafka offsets when using DirectKafkaInputDStream

2016-09-03 Thread Cody Koeninger
The Kafka commit api isn't transactional, you aren't going to get
exactly once behavior out of it even if you were committing offsets on
a per-partition basis.  This doesn't really have anything to do with
Spark; the old code you posted was already inherently broken.

Make your outputs idempotent and use commitAsync.
Or store offsets transactionally in your own data store.



On Fri, Sep 2, 2016 at 5:50 PM, vonnagy  wrote:
> I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I
> have a stream that I extract the data and would like to update the Kafka
> offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka
> 0.8.2 I was able to update the offsets, but now there seems no way to do so.
> Here is an example
>
> val stream = getStream
>
> stream.forEachRDD { rdd =>
> val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>
> rdd.foreachPartition { events =>
> val partId = TaskContext.get.partitionId
> val offsets = offsetRanges(partId)
>
> // Do something with the data
>
> // Update the offsets for the partition so at most, the partition's
> data would be duplicated
> }
> }
>
> With the new stream, I could call `commitAsync` with the offsets, but the
> drawback here is that it would only update the offsets after the entire RDD
> is handled. This can be a real issue for near "exactly once".
>
> With the new logic, each partition has a Kafka consumer associated with each
> partition, however, there is no access to it. I have looked at the
> CachedKafkaConsumer classes and there is no way at the cache as well so that
> I could call a commit on the offsets.
>
> Beyond that I have tried to use the new Kafka 0.10 APIs, but always run into
> errors as it requires one to subscribe to the topic and get assigned
> partitions. I only want to update the offsets in Kafka.
>
> Any ideas would be helpful of how I might work with the Kafka API to set the
> offsets or get Spark to add logic to allow the commitment of offsets on a
> partition basis.
>
> Thanks,
>
> Ivan
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Committing-Kafka-offsets-when-using-DirectKafkaInputDStream-tp18840.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Subscription

2016-09-03 Thread Omkar Reddy
Subscribe me!


Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Kapil Malik
Hi all,

I have a Spark SQL 1.6 application in production which does following on
executing sqlContext.sql(...) -
1. Identify the table-name mentioned in query
2. Use an external database to decide where's the data located, in which
format (parquet or csv or jdbc) etc.
3. Load the dataframe
4. Register it as temp table (for future calls to this table)

This is achieved by extending HiveContext, and correspondingly HiveCatalog.
I have my own implementation of trait "Catalog", which over-rides the
"lookupRelation" method to do the magic behind the scenes.

However, in spark 2.0, I can see following -
SessionCatalog - which contains lookupRelation method, but doesn't have any
interface / abstract class to it.
ExternalCatalog - which deals with CatalogTable instead of Df / LogicalPlan.
Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.

So apparently it looks like I need to extend SessionCatalog only.
However, just wanted to get a feedback on if there's a better / recommended
approach to achieve this.


Thanks and regards,


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-433 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*


Re: Support for Hive 2.x

2016-09-03 Thread Steve Loughran

On 2 Sep 2016, at 18:40, Dongjoon Hyun 
> wrote:

Hi, Rostyslav,

After your email, I also tried to search in this morning, but I didn't find a 
proper one.

The last related issue is SPARK-8064, `Upgrade Hive to 1.2`

https://issues.apache.org/jira/browse/SPARK-8064

If you want, you can file an JIRA issue including your pain points, then you 
can monitor through it.

I guess you have more reasons to do that, not just a compilation issue.



That was a pretty major change, as Spark SQL and Spark Thrift server  make use 
of the library in ways that the Hive authors never intended —and so forced the 
spark teams to do terrible things to get stuff to hook up (thrift)

In The SQL side of things, parser changes broke stuff, as did changed error 
messages. Work there involved catching up with the changes, and differentiating 
regressions from simple changes in error messages triggering false alarms.

oh, and then there was the kryo version. Twitter have been moving Chill -> Kryo 
3 in sync with their other codebase (storm?), spark's kryo version is driven by 
Chill; Hive needs to be in sync there or (as is done for Spark, a custom build 
of hive.jar made forcing it into the same version as chill & spark).

I did some preparatory work on a branch opening hive thrift server up for 
better subclassing

https://issues.apache.org/jira/browse/SPARK-10793

(FWIW Hive 1.2.1 actually uses a coy and past of the Hadoop 0.23 version of the 
hadoop yarn service classes, without the YARN-117 changes. If they could be 
moved back to the Hadoop reference implementation (i.e. commit to Hadoop 2.2+ 
and migrate back), and the thrift classes were reworked for better subclassing, 
life would be simpler —leaving only the SQL changes and protobuf and kryo 
versions...

Bests,
Dongjoon.



On Fri, Sep 2, 2016 at 12:51 AM, Rostyslav Sotnychenko 
> wrote:
Hello!

I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed.

So I am wondering if there is any talks going on about adding support of Hive 
2.x to Spark? I was unable to find any JIRA about this.


Thanks,
Rostyslav