Re: PIG to Spark
I may be wrong here, but when I see github and apache pig, it says that there are 8 contributors, and when I see github and look at apache spark it says there are more than 1000 contributors. And if the above is true I ask myself, why not shift to SPARK by learning it? I also started with map reduce JAVA nightmarish coding, and then HIVE and then Pig and then realised the best time spent is the time used for actually solving data problems than programming problems. I know consultants who have ended up convincing their clients that its better to write JAVA programs than use SPARK SQL and then have spent close to 2.5 years not being able to deliver anything that works where the actual project was just a single SPARK SQL. Personally I prefer to learn and adapt and transfer existing code to a platform that gives me the maximum business benefit with least headaches. But once again it is just a matter of opinion. Regards, Gourav Sengupta On Mon, Jan 8, 2018 at 3:25 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > Hi > > Is there a convenient way /open source project to convert PIG scripts to > Spark. > > > Regards > Pralabh Kumar >
Re: PIG to Spark
Pig support spark engine now, so you can leverage spark execution with pig script. I am afraid there's no solution to convert pig script to spark api code Pralabh Kumar <pralabhku...@gmail.com>于2018年1月8日周一 下午11:25写道: > Hi > > Is there a convenient way /open source project to convert PIG scripts to > Spark. > > > Regards > Pralabh Kumar >
PIG to Spark
Hi Is there a convenient way /open source project to convert PIG scripts to Spark. Regards Pralabh Kumar
queries on Spork (Pig on Spark)
> > Hi, As a beginner ,I have below queries on Spork(Pig on Spark). I have cloned git clone https://github.com/apache/pig -b spark . 1.On which version of Pig and Spark , Spork is being built ? 2. I followed the steps mentioned in https://issues.apache.org/ji ra/browse/PIG-4059 and try to run simple pig script just like Load the file and dump/store it. Getting errors : > grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as (empNo:chararray,empName:chararray,salary:chararray); grunt> Store A into '/tmp/spork'; 2015-11-25 05:35:52,502 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2015-11-25 05:35:52,875 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized 2015-11-25 05:35:52,883 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR mode. RollupHIIOptimizer is disabled 2015-11-25 05:35:52,894 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]} 2015-11-25 05:35:52,966 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-11-25 05:35:52,983 [main] INFO org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add Files Spark Job 2015-11-25 05:35:53,137 [main] INFO org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added jar pig-0.15.0-SNAPSHOT-core-h2.jar 2015-11-25 05:35:53,138 [main] INFO org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added jar pig-0.15.0-SNAPSHOT-core-h2.jar 2015-11-25 05:35:53,138 [main] INFO org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Converting operator POLoad (Name: A: Load(/tmp/words_tb.txt:PigStorage(' ')) - scope-29 Operator Key: scope-29) 2015-11-25 05:35:53,205 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Could not initialize class org.apache.spark.rdd.RDDOperationScope$ Details at logfile: /home/pig/pig_1448425672112.log Can you please help me in pointing whats wrong ? Appreciate your help . Thanks, Regards, Divya
Re: queries on Spork (Pig on Spark)
Log files content : Pig Stack Trace --- ERROR 2998: Unhandled internal error. Could not initialize class org.apache.spark.rdd.RDDOperationScope$ java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$ at org.apache.spark.SparkContext.withScope(SparkContext.scala:681) at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:1094) at org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:91) at org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:61) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:666) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:585) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:534) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:209) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) at org.apache.pig.PigServer.storeEx(PigServer.java:1034) at org.apache.pig.PigServer.store(PigServer.java:997) at org.apache.pig.PigServer.openIterator(PigServer.java:910) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at org.apache.pig.Main.run(Main.java:558) at org.apache.pig.Main.main(Main.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Didn't understand the problem behind the error . Thanks, Regards, Divya On 25 November 2015 at 14:00, Jeff Zhang <zjf...@gmail.com> wrote: > >>> Details at logfile: /home/pig/pig_1448425672112.log > > You need to check the log file for details > > > > > On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> >> >> As a beginner ,I have below queries on Spork(Pig on Spark). >> I have cloned git clone https://github.com/apache/pig -b spark . >> 1.On which version of Pig and Spark , Spork is being built ? >> 2. I followed the steps mentioned in https://issues.apache.org/ji >> ra/browse/PIG-4059 and try to run simple pig script just like Load the >> file and dump/store it. >> Getting errors : >> >>> >> grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as >> (empNo:chararray,empName:chararray,salary:chararray); >> grunt> Store A into >> '/tmp/spork'; >> >> 2015-11-25 05:35:52,502 [main] INFO >> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the >> script: UNKNOWN >> 2015-11-25 05:35:52,875 [main] WARN >> org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already >> been initialized >> 2015-11-25 05:35:52,883 [main] INFO >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR >> mode. RollupHIIOptimizer is disabled >> 2015-11-25 05:35:52,894 [main] INFO >> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - >> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, >> GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, >> MergeFilter, MergeForEach, PartitionFilterOptimizer, >> PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, >> SplitFilter, StreamTypeCastInserter]} >> 2015-11-25 05:35:52,966 [main] INFO >> org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not >> set... will not generate code. >> 2015-11-25 05:35:52,983 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add >> Files Spark Job >> 2015-11-25 05:35:53,137 [main] INFO >>
Re: Update on Pig on Spark initiative
This is really exciting! Thanks so much for this work, I think you've guaranteed Pig's continued vitality. On Wednesday, August 27, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com javascript:_e(%7B%7D,'cvml','mayur.rust...@gmail.com');) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Update on Pig on Spark initiative
Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi
Re: Update on Pig on Spark initiative
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi
Re: Re: Pig 0.13, Spark, Spork
Hi Bertrand, We've updated the document http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0 This is our working Github repo https://github.com/sigmoidanalytics/spork/tree/spork-0.9 Feel free to open issues over here https://github.com/sigmoidanalytics/spork/issues Thanks Best Regards On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux decho...@gmail.com wrote: @Mayur : I won't fight with the semantic of a fork but at the moment, no Spork does take the standard Pig as dependency. On that, we should agree. As for my use of Pig, I have no limitation. I am however interested to see the rise of a 'no-sql high level non programming language' for Spark. @Zhang : Could you elaborate your reference about Twitter? Bertrand Dechoux On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 pelickzh...@qq.com wrote: Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine https://github.com/pelick/flare-spork It it also highly experimental, and just directly mapping pig physical operations to spark RDD transformations/actions. It works for simple requests. :) I am also interested on the progress of spork, is it undergoing in Twitter in an un open-source way? -- Thanks Zhang Baofeng Blog http://blog.csdn.net/pelick | Github https://github.com/pelick | Weibo http://weibo.com/pelickzhang | LinkedIn http://www.linkedin.com/pub/zhang-baofeng/70/609/84 -- 原始邮件 -- *发件人:* Mayur Rustagi;mayur.rust...@gmail.com; *发送时间:* 2014年7月7日(星期一) 晚上11:55 *收件人:* user@spark.apache.orguser@spark.apache.org; *主题:* Re: Pig 0.13, Spark, Spork That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I saw a wiki page from your company but with an old version of Spark. http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 I have no reason to use it yet but I am interested in the state of the initiative. What's your point of view (personal and/or professional) about the Pig 0.13 release? Is the pluggable execution engine flexible enough in order to avoid having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D As a (for now) external observer, I am glad to see competition in that space. It can only be good for the community in the end. Bertrand Dechoux On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Pig 0.13, Spark, Spork
Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: Pig 0.13, Spark, Spork
Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: Pig 0.13, Spark, Spork
That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I saw a wiki page from your company but with an old version of Spark. http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 I have no reason to use it yet but I am interested in the state of the initiative. What's your point of view (personal and/or professional) about the Pig 0.13 release? Is the pluggable execution engine flexible enough in order to avoid having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D As a (for now) external observer, I am glad to see competition in that space. It can only be good for the community in the end. Bertrand Dechoux On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: Pig 0.13, Spark, Spork
Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine https://github.com/pelick/flare-spork It it also highly experimental, and just directly mapping pig physical operations to spark RDD transformations/actions. It works for simple requests. :) I am also interested on the progress of spork, is it undergoing in Twitter in an un open-source way? -- Thanks Zhang Baofeng Blog | Github | Weibo | LinkedIn -- 原始邮件 -- 发件人: Mayur Rustagi;mayur.rust...@gmail.com; 发送时间: 2014年7月7日(星期一) 晚上11:55 收件人: user@spark.apache.orguser@spark.apache.org; 主题: Re: Pig 0.13, Spark, Spork That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote: I saw a wiki page from your company but with an old version of Spark.http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 I have no reason to use it yet but I am interested in the state of the initiative. What's your point of view (personal and/or professional) about the Pig 0.13 release? Is the pluggable execution engine flexible enough in order to avoid having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D As a (for now) external observer, I am glad to see competition in that space. It can only be good for the community in the end. Bertrand Dechoux On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi,We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote: Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it. Regards Bertrand Dechoux
Re: Pig on Spark
Hey Mayur, We use HiveColumnarLoader and XMLLoader. Are these working as well ? Will try few things regarding porting Java MR. Regards, Suman Bharadwaj S On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Right now UDF is not working. Its in the top list though. You should be able to soon :) Are thr any other functionality of pig you use often apart from the usual suspects?? Existing Java MR jobs would be a easier move. are these cascading jobs or single map reduce jobs. If single then you should be able to, write a scala wrapper code code to call map reduce functions with some magic let your core code be. Would be interesting to see an actual example get it to work. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote: We currently are in the process of converting PIG and Java map reduce jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was checking if we can leverage SPORK without converting to SPARK jobs. And is there any way I can port my existing Java MR jobs to SPARK ? I know this thread has a different subject, let me know if need to ask this question in separate thread. Thanks in advance. On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote: Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com wrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM
Re: Pig on Spark
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark, I've often thought that it feels very unnatural to use imperative code to declare a pipeline.
Re: Pig on Spark
It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language that deals with RDDs for you, you can use spark sql http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Of course you can express less things this way, but if you have some complex logic I think it would make sense to write a classic spark job that would be more robust in the long term. 2014-04-25 15:30 GMT+02:00 Mark Baker dist...@acm.org: I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark, I've often thought that it feels very unnatural to use imperative code to declare a pipeline.
Re: Pig on Spark
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. The devil is in the details - allowing developers to declare *what* they want - seems not practical in a declarative world since we are bound by the DSL constructs. The work around or rather hack is to have UDFs to have full language constructs. Some problems are hard, you will have twist your mind to solve in a restrictive way. At that time, we think, we wish we have complete language power. Being in Big Data world for short time (7 years), seen enough problems with Hive/Pig. All I am providing here is a thought to spark the Spark community to think beyond declarative constructs. I am sure there is a place for Pig and Hive. -Bharath On Fri, Apr 25, 2014 at 10:21 AM, Michael Armbrust mich...@databricks.comwrote: On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker dist...@acm.org wrote: I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. Having Pig too would certainly be a win, but Spark SQLhttp://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.htmlis also a declarative layer on top of Spark. Since the optimization is lazy, you can chain multiple SQL statements in a row and still optimize them holistically (similar to a pig job). Alpha version coming soon to a Spark 1.0 release near you! Spark SQL also lets to drop back into functional Scala when that is more natural for a particular task.
Re: Pig on Spark
Hi, We got spork working on spark 0.9.0 Repository available at: https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix Please suggest your feedback. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p4668.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Pig on Spark
Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.comwrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig
Re: Pig on Spark
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.comwrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we
Re: Pig on Spark
UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote: Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com wrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest
Re: Pig on Spark
We currently are in the process of converting PIG and Java map reduce jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was checking if we can leverage SPORK without converting to SPARK jobs. And is there any way I can port my existing Java MR jobs to SPARK ? I know this thread has a different subject, let me know if need to ask this question in separate thread. Thanks in advance. On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote: Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com wrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can
Re: Pig on Spark
Right now UDF is not working. Its in the top list though. You should be able to soon :) Are thr any other functionality of pig you use often apart from the usual suspects?? Existing Java MR jobs would be a easier move. are these cascading jobs or single map reduce jobs. If single then you should be able to, write a scala wrapper code code to call map reduce functions with some magic let your core code be. Would be interesting to see an actual example get it to work. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote: We currently are in the process of converting PIG and Java map reduce jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was checking if we can leverage SPORK without converting to SPARK jobs. And is there any way I can port my existing Java MR jobs to SPARK ? I know this thread has a different subject, let me know if need to ask this question in separate thread. Thanks in advance. On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote: Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com wrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves
Re: Pig on Spark
Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am currently trying to deploy it on 0.9.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote: We will post fixes from our side at - https://github.com/twitter/pig. Top on our list are- 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or 0.9 spark). 2. Support for algebraic udfs (this mitigates the group by oom problems). Would definitely love more contribution on this. Thanks, Aniket On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Dam I am off to NY for Structure Conf. Would it be possible to meet anytime after 28th March? I am really interested in making it stable production quality. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote: Hi Mayur, Are you going to the Pig meetup this afternoon? http://www.meetup.com/PigUser/events/160604192/ Aniket and I will be there. We would be happy to chat about Pig-on-Spark On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Lin, We are working on getting Pig on spark functional with 0.8.0, have you got it working on any spark version ? Also what all functionality works on it? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we
Re: Pig on Spark
Bam !!! http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 10, 2014 at 3:07 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am currently trying to deploy it on 0.9.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote: We will post fixes from our side at - https://github.com/twitter/pig. Top on our list are- 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or 0.9 spark). 2. Support for algebraic udfs (this mitigates the group by oom problems). Would definitely love more contribution on this. Thanks, Aniket On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Dam I am off to NY for Structure Conf. Would it be possible to meet anytime after 28th March? I am really interested in making it stable production quality. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote: Hi Mayur, Are you going to the Pig meetup this afternoon? http://www.meetup.com/PigUser/events/160604192/ Aniket and I will be there. We would be happy to chat about Pig-on-Spark On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Lin, We are working on getting Pig on spark functional with 0.8.0, have you got it working on any spark version ? Also what all functionality works on it? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we
Re: Pig on Spark
Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am currently trying to deploy it on 0.9.0. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.com wrote: We will post fixes from our side at - https://github.com/twitter/pig. Top on our list are- 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or 0.9 spark). 2. Support for algebraic udfs (this mitigates the group by oom problems). Would definitely love more contribution on this. Thanks, Aniket On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Dam I am off to NY for Structure Conf. Would it be possible to meet anytime after 28th March? I am really interested in making it stable production quality. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote: Hi Mayur, Are you going to the Pig meetup this afternoon? http://www.meetup.com/PigUser/events/160604192/ Aniket and I will be there. We would be happy to chat about Pig-on-Spark On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Hi Lin, We are working on getting Pig on spark functional with 0.8.0, have you got it working on any spark version ? Also what all functionality works on it? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket, Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser- look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com
Re: Pig on Spark
Hi, I have been following Aniket's spork github repository. https://github.com/aniket486/pig I have done all the changes mentioned in recently modified pig-spark file. I am using: hadoop 2.0.5 alpha spark-0.8.1-incubating mesos 0.16.0 ##PIG variables export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/hadoop export *SPARK_YARN_APP_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar export *SPARK_JAVA_OPTS*= -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.dump export *SPARK_JAR*=/home/ubuntu/spark/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar export *SPARK_MASTER*=yarn-client export *SPARK_HOME*=/home/ubuntu/spark export *SPARK_JARS*=/home/ubuntu/pig/contrib/piggybank/java/piggybank.jar export *PIG_CLASSPATH*=${SPARK_JAR}:${SPARK_JARS}:/home/ubuntu/mesos/build/src/mesos-0.16.0.jar:/home/ubuntu/pig/pig-withouthadoop.jar export *SPARK_PIG_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar This works fine in Mapreduce and local mode. But, while running on spark mode I am facing follwing error. This error come after the job is submitted and run on yarn-master. Can you please tell me how to proceed. ###error message *ERROR 2998*: *Unhandled internal error. class org.apache.spark.util.InnerClosureFinder* has interface org.objectweb.asm.ClassVisitor as super class *java.lang.IncompatibleClassChangeError*: class org.apache.spark.util.InnerClosureFinder has interface org.objectweb.asm.ClassVisitor as super class at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:643) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) at org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:87) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107) at org.apache.spark.SparkContext.clean(SparkContext.scala:970) at org.apache.spark.rdd.RDD.map(RDD.scala:246) at org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:68) at org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:38) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:212) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201) at org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:125) at org.apache.pig.PigServer.launchPlan(PigServer.java:1328) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1310) at org.apache.pig.PigServer.storeEx(PigServer.java:993) at org.apache.pig.PigServer.store(PigServer.java:957) at org.apache.pig.PigServer.openIterator(PigServer.java:870) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:729) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:370) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:609) at org.apache.pig.Main.main(Main.java:158) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p3187.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Pig on Spark
Hi Mayur,We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer,Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote: Hi Aniket,Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and if there’s enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig’s operators working. (I helped out with this prototype over Twitter’s hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts. -- ...:::Aniket:::... Quetzalco@tl
Re: PIG to SPARK
Thanks Mayur. I don't have clear idea on how pipe works wanted to understand more on it. But when do we use pipe() and how it works ?. Can you please share some sample code if you have ( even pseudo-code is fine ) ? It will really help. Regards, Suman Bharadwaj S On Thu, Mar 6, 2014 at 3:46 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: The real question is why do you want to run pig script using Spark Are you planning to user spark as underlying processing engine for Spark? thats not simple Are you planning to feed Pig data to spark for further processing, then you can write it to HDFS trigger your spark script. rdd.pipe is basically similar to Hadoop streaming, allowing you to run a script on each partition of the RDD get output as another RDD. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 10:29 AM, suman bharadwaj suman@gmail.comwrote: Hi, How can i call pig script using SPARK. Can I use rdd.pipe() here ? And can anyone share sample implementation of rdd.pipe () and if you can explain how rdd.pipe() works, it would really really help. Regards, SB
Pig on Spark
Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.
Re: Pig on Spark
I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and if there’s enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig’s operators working. (I helped out with this prototype over Twitter’s hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts.
Re: Pig on Spark
There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues- 1. Limit does not work (spork-fix) 2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira) 3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs) 5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I'd suggest asking Dmitriy if you know him. I've seen interest in this from several other groups, and if there's enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig's operators working. (I helped out with this prototype over Twitter's hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts. -- ...:::Aniket:::... Quetzalco@tl
RE: Pig on Spark
Hi Aniket,Many thanks! I will check this out. Date: Thu, 6 Mar 2014 13:46:50 -0800 Subject: Re: Pig on Spark From: aniket...@gmail.com To: user@spark.apache.org; tgraves...@yahoo.com There is some work to make this work on yarn at https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23) You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find out what sort of env variables you need (sorry, I haven't been able to clean this up- in-progress). There are few known issues with this, I will work on fixing them soon. Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work (spork-fix in-progress) 4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars) ~Aniket On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote: I had asked a similar question on the dev mailing list a while back (Jan 22nd). See the archives: http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look for spork. Basically Matei said: Yup, that was it, though I believe people at Twitter picked it up again recently. I’d suggest asking Dmitriy if you know him. I’ve seen interest in this from several other groups, and if there’s enough of it, maybe we can start another open source repo to track it. The work in that repo you pointed to was done over one week, and already had most of Pig’s operators working. (I helped out with this prototype over Twitter’s hack week.) That work also calls the Scala API directly, because it was done before we had a Java API; it should be easier with the Java one. Tom On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantly benefit by using Spark, but we would like to keep using the existing Pig scripts. -- ...:::Aniket:::... Quetzalco@tl
Re: PIG to SPARK
The real question is why do you want to run pig script using Spark Are you planning to user spark as underlying processing engine for Spark? thats not simple Are you planning to feed Pig data to spark for further processing, then you can write it to HDFS trigger your spark script. rdd.pipe is basically similar to Hadoop streaming, allowing you to run a script on each partition of the RDD get output as another RDD. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Mar 5, 2014 at 10:29 AM, suman bharadwaj suman@gmail.comwrote: Hi, How can i call pig script using SPARK. Can I use rdd.pipe() here ? And can anyone share sample implementation of rdd.pipe () and if you can explain how rdd.pipe() works, it would really really help. Regards, SB