Contribution to Apache Spark
Hello, I am Aditya Vyas and I am currently in my third year of college doing BTech in my engineering. I know python, a little bit of Java. I want to start contribution in Apache Spark. This is my first time in the field of Big Data. Can someone please help me as to how to get started. Which resources to look at? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-to-Apache-Spark-tp18852.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Is Spark's KMeans unable to handle bigdata?
Thank you very much Sean! If you would like, this could serve as an answer in StackOverflow's question: [Is Spark's kMeans unable to handle bigdata?]( http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata ). Enjoy your weekend, George On Sat, Sep 3, 2016 at 1:22 AM, Sean Owenwrote: > I opened https://issues.apache.org/jira/browse/SPARK-17389 to track > some improvements, but by far the big one is that the init steps > defaults to 5, when the paper says that 2 is pretty much optimal here. > It's much faster with that setting. > > On Fri, Sep 2, 2016 at 6:45 PM, Georgios Samaras > wrote: > > I am not using the "runs" parameter anyway, but I see your point. If you > > could point out any modifications in the minimal example I posted, I > would > > be more than interested to try them! > > >
Re: Committing Kafka offsets when using DirectKafkaInputDStream
The Kafka commit api isn't transactional, you aren't going to get exactly once behavior out of it even if you were committing offsets on a per-partition basis. This doesn't really have anything to do with Spark; the old code you posted was already inherently broken. Make your outputs idempotent and use commitAsync. Or store offsets transactionally in your own data store. On Fri, Sep 2, 2016 at 5:50 PM, vonnagywrote: > I have upgrading to Spark 2.0 and am experimenting with using Kafka 0.10.0. I > have a stream that I extract the data and would like to update the Kafka > offsets as each partition is handled. With Spark 1.6 or Spark 2.0 and Kafka > 0.8.2 I was able to update the offsets, but now there seems no way to do so. > Here is an example > > val stream = getStream > > stream.forEachRDD { rdd => > val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges > > rdd.foreachPartition { events => > val partId = TaskContext.get.partitionId > val offsets = offsetRanges(partId) > > // Do something with the data > > // Update the offsets for the partition so at most, the partition's > data would be duplicated > } > } > > With the new stream, I could call `commitAsync` with the offsets, but the > drawback here is that it would only update the offsets after the entire RDD > is handled. This can be a real issue for near "exactly once". > > With the new logic, each partition has a Kafka consumer associated with each > partition, however, there is no access to it. I have looked at the > CachedKafkaConsumer classes and there is no way at the cache as well so that > I could call a commit on the offsets. > > Beyond that I have tried to use the new Kafka 0.10 APIs, but always run into > errors as it requires one to subscribe to the topic and get assigned > partitions. I only want to update the offsets in Kafka. > > Any ideas would be helpful of how I might work with the Kafka API to set the > offsets or get Spark to add logic to allow the commitment of offsets on a > partition basis. > > Thanks, > > Ivan > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Committing-Kafka-offsets-when-using-DirectKafkaInputDStream-tp18840.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Subscription
Subscribe me!
Catalog, SessionCatalog and ExternalCatalog in spark 2.0
Hi all, I have a Spark SQL 1.6 application in production which does following on executing sqlContext.sql(...) - 1. Identify the table-name mentioned in query 2. Use an external database to decide where's the data located, in which format (parquet or csv or jdbc) etc. 3. Load the dataframe 4. Register it as temp table (for future calls to this table) This is achieved by extending HiveContext, and correspondingly HiveCatalog. I have my own implementation of trait "Catalog", which over-rides the "lookupRelation" method to do the magic behind the scenes. However, in spark 2.0, I can see following - SessionCatalog - which contains lookupRelation method, but doesn't have any interface / abstract class to it. ExternalCatalog - which deals with CatalogTable instead of Df / LogicalPlan. Catalog - which also doesn't expose any method to lookup Df / LogicalPlan. So apparently it looks like I need to extend SessionCatalog only. However, just wanted to get a feedback on if there's a better / recommended approach to achieve this. Thanks and regards, Kapil Malik *Sr. Principal Engineer | Data Platform, Technology* M: +91 8800836581 | T: 0124-433 | EXT: 20910 ASF Centre A | 1st Floor | Udyog Vihar Phase IV | Gurgaon | Haryana | India *Disclaimer:* This communication is for the sole use of the addressee and is confidential and privileged information. If you are not the intended recipient of this communication, you are prohibited from disclosing it and are required to delete it forthwith. Please note that the contents of this communication do not necessarily represent the views of Jasper Infotech Private Limited ("Company"). E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The Company, therefore, does not accept liability for any loss caused due to this communication. *Jasper Infotech Private Limited, Registered Office: 1st Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN: U72300DL2007PTC168097*
Re: Support for Hive 2.x
On 2 Sep 2016, at 18:40, Dongjoon Hyun> wrote: Hi, Rostyslav, After your email, I also tried to search in this morning, but I didn't find a proper one. The last related issue is SPARK-8064, `Upgrade Hive to 1.2` https://issues.apache.org/jira/browse/SPARK-8064 If you want, you can file an JIRA issue including your pain points, then you can monitor through it. I guess you have more reasons to do that, not just a compilation issue. That was a pretty major change, as Spark SQL and Spark Thrift server make use of the library in ways that the Hive authors never intended —and so forced the spark teams to do terrible things to get stuff to hook up (thrift) In The SQL side of things, parser changes broke stuff, as did changed error messages. Work there involved catching up with the changes, and differentiating regressions from simple changes in error messages triggering false alarms. oh, and then there was the kryo version. Twitter have been moving Chill -> Kryo 3 in sync with their other codebase (storm?), spark's kryo version is driven by Chill; Hive needs to be in sync there or (as is done for Spark, a custom build of hive.jar made forcing it into the same version as chill & spark). I did some preparatory work on a branch opening hive thrift server up for better subclassing https://issues.apache.org/jira/browse/SPARK-10793 (FWIW Hive 1.2.1 actually uses a coy and past of the Hadoop 0.23 version of the hadoop yarn service classes, without the YARN-117 changes. If they could be moved back to the Hadoop reference implementation (i.e. commit to Hadoop 2.2+ and migrate back), and the thrift classes were reworked for better subclassing, life would be simpler —leaving only the SQL changes and protobuf and kryo versions... Bests, Dongjoon. On Fri, Sep 2, 2016 at 12:51 AM, Rostyslav Sotnychenko > wrote: Hello! I tried compiling Spark 2.0 with Hive 2.0, but as expected this failed. So I am wondering if there is any talks going on about adding support of Hive 2.x to Spark? I was unable to find any JIRA about this. Thanks, Rostyslav