Re: PIG to Spark

2018-01-09 Thread Gourav Sengupta
I may be wrong here, but when I see github and apache pig, it says that
there are 8 contributors, and when I see github and look at apache spark it
says there are more than 1000 contributors. And if the above is true I ask
myself, why not shift to SPARK by learning it?

I also started with map reduce JAVA nightmarish coding, and then HIVE and
then Pig and then realised the best time spent is the time used for
actually solving data problems than programming problems. I know
consultants who have ended up convincing their clients that its better to
write JAVA programs than use SPARK SQL and then have spent close to 2.5
years not being able to deliver anything that works where the actual
project was just a single SPARK SQL.

Personally I prefer to learn and adapt and transfer existing code to a
platform that gives me the maximum business benefit with least headaches.

But once again it is just a matter of opinion.


Regards,
Gourav Sengupta

On Mon, Jan 8, 2018 at 3:25 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> Hi
>
> Is there a convenient way /open source project to convert PIG scripts to
> Spark.
>
>
> Regards
> Pralabh Kumar
>


Re: PIG to Spark

2018-01-08 Thread Jeff Zhang
Pig support spark engine now, so you can leverage spark execution with pig
script.

I am afraid there's no solution to convert pig script to spark api code





Pralabh Kumar <pralabhku...@gmail.com>于2018年1月8日周一 下午11:25写道:

> Hi
>
> Is there a convenient way /open source project to convert PIG scripts to
> Spark.
>
>
> Regards
> Pralabh Kumar
>


PIG to Spark

2018-01-08 Thread Pralabh Kumar
Hi

Is there a convenient way /open source project to convert PIG scripts to
Spark.


Regards
Pralabh Kumar


queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
>
> Hi,


As a beginner ,I have below queries on Spork(Pig on Spark).
I have cloned  git clone https://github.com/apache/pig -b spark .
1.On which version of Pig and Spark , Spork  is being built ?
2. I followed the steps mentioned in   https://issues.apache.org/ji
ra/browse/PIG-4059 and try to run simple pig script just like Load the file
and dump/store it.
Getting errors :

>
grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as
(empNo:chararray,empName:chararray,salary:chararray);
grunt> Store A into
'/tmp/spork';

2015-11-25 05:35:52,502 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: UNKNOWN
2015-11-25 05:35:52,875 [main] WARN  org.apache.pig.data.SchemaTupleBackend
- SchemaTupleBackend has already been initialized
2015-11-25 05:35:52,883 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR
mode. RollupHIIOptimizer is disabled
2015-11-25 05:35:52,894 [main] INFO
org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
MergeFilter, MergeForEach, PartitionFilterOptimizer,
PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter,
SplitFilter, StreamTypeCastInserter]}
2015-11-25 05:35:52,966 [main] INFO  org.apache.pig.data.SchemaTupleBackend
- Key [pig.schematuple] was not set... will not generate code.
2015-11-25 05:35:52,983 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add
Files Spark Job
2015-11-25 05:35:53,137 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
jar pig-0.15.0-SNAPSHOT-core-h2.jar
2015-11-25 05:35:53,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - Added
jar pig-0.15.0-SNAPSHOT-core-h2.jar
2015-11-25 05:35:53,138 [main] INFO
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher -
Converting operator POLoad (Name: A:
Load(/tmp/words_tb.txt:PigStorage(' ')) - scope-29 Operator Key: scope-29)
2015-11-25 05:35:53,205 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
Details at logfile: /home/pig/pig_1448425672112.log


Can you please help me in pointing whats wrong ?

Appreciate your help .

Thanks,

Regards,

Divya


Re: queries on Spork (Pig on Spark)

2015-11-24 Thread Divya Gehlot
Log files content :
Pig Stack Trace
---
ERROR 2998: Unhandled internal error. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
 at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
 at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:1094)
 at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:91)
 at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:61)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:666)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:585)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:534)
 at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:209)
 at
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
 at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
 at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
 at org.apache.pig.PigServer.storeEx(PigServer.java:1034)
 at org.apache.pig.PigServer.store(PigServer.java:997)
 at org.apache.pig.PigServer.openIterator(PigServer.java:910)
 at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
 at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
 at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
 at org.apache.pig.Main.run(Main.java:558)
 at org.apache.pig.Main.main(Main.java:170)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


Didn't understand the problem behind the error .

Thanks,
Regards,
Divya

On 25 November 2015 at 14:00, Jeff Zhang <zjf...@gmail.com> wrote:

> >>> Details at logfile: /home/pig/pig_1448425672112.log
>
> You need to check the log file for details
>
>
>
>
> On Wed, Nov 25, 2015 at 1:57 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> As a beginner ,I have below queries on Spork(Pig on Spark).
>> I have cloned  git clone https://github.com/apache/pig -b spark .
>> 1.On which version of Pig and Spark , Spork  is being built ?
>> 2. I followed the steps mentioned in   https://issues.apache.org/ji
>> ra/browse/PIG-4059 and try to run simple pig script just like Load the
>> file and dump/store it.
>> Getting errors :
>>
>>>
>> grunt> A = load '/tmp/words_tb.txt' using PigStorage('\t') as
>> (empNo:chararray,empName:chararray,salary:chararray);
>> grunt> Store A into
>> '/tmp/spork';
>>
>> 2015-11-25 05:35:52,502 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
>> script: UNKNOWN
>> 2015-11-25 05:35:52,875 [main] WARN
>> org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already
>> been initialized
>> 2015-11-25 05:35:52,883 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - Not MR
>> mode. RollupHIIOptimizer is disabled
>> 2015-11-25 05:35:52,894 [main] INFO
>> org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer -
>> {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator,
>> GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter,
>> MergeFilter, MergeForEach, PartitionFilterOptimizer,
>> PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter,
>> SplitFilter, StreamTypeCastInserter]}
>> 2015-11-25 05:35:52,966 [main] INFO
>> org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not
>> set... will not generate code.
>> 2015-11-25 05:35:52,983 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher - add
>> Files Spark Job
>> 2015-11-25 05:35:53,137 [main] INFO
>> 

Re: Update on Pig on Spark initiative

2014-08-28 Thread Russell Jurney
This is really exciting! Thanks so much for this work, I think you've
guaranteed Pig's continued vitality.

On Wednesday, August 27, 2014, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Awesome to hear this, Mayur! Thanks for putting this together.

 Matei

 On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com
 javascript:_e(%7B%7D,'cvml','mayur.rust...@gmail.com');) wrote:

 Hi,
 We have migrated Pig functionality on top of Spark passing 100% e2e for
 success cases in pig test suite. That means UDF, Joins  other
 functionality is working quite nicely. We are in the process of merging
 with Apache Pig trunk(something that should happen over the next 2 weeks).
 Meanwhile if you are interested in giving it a go, you can try it at
 https://github.com/sigmoidanalytics/spork
 This contains all the major changes but may not have all the patches
 required for 100% e2e, if you are trying it out let me know any issues you
 face

 Whole bunch of folks contributed on this

 Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das
 (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal
 Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket
 Mokashi  (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid
 Analytics), Mahesh Kalakoti (Sigmoid Analytics)

 Not to mention Spark  Pig communities.

 Regards
  Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for
success cases in pig test suite. That means UDF, Joins  other
functionality is working quite nicely. We are in the process of merging
with Apache Pig trunk(something that should happen over the next 2 weeks).
Meanwhile if you are interested in giving it a go, you can try it at
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches
required for 100% e2e, if you are trying it out let me know any issues you
face

Whole bunch of folks contributed on this

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi
 (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics),
Mahesh Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities.

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together.

Matei

On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) 
wrote:

Hi,
We have migrated Pig functionality on top of Spark passing 100% e2e for success 
cases in pig test suite. That means UDF, Joins  other functionality is working 
quite nicely. We are in the process of merging with Apache Pig trunk(something 
that should happen over the next 2 weeks). 
Meanwhile if you are interested in giving it a go, you can try it at 
https://github.com/sigmoidanalytics/spork
This contains all the major changes but may not have all the patches required 
for 100% e2e, if you are trying it out let me know any issues you face

Whole bunch of folks contributed on this 

Julien Le Dem (Twitter),  Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid 
Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga 
(Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics),  Aniket Mokashi  
(Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh 
Kalakoti (Sigmoid Analytics)

Not to mention Spark  Pig communities. 

Regards
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Akhil Das
Hi Bertrand,

We've updated the document
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0

This is our working Github repo
https://github.com/sigmoidanalytics/spork/tree/spork-0.9

Feel free to open issues over here
https://github.com/sigmoidanalytics/spork/issues

Thanks
Best Regards


On Tue, Jul 8, 2014 at 2:33 PM, Bertrand Dechoux decho...@gmail.com wrote:

 @Mayur : I won't fight with the semantic of a fork but at the moment, no
 Spork does take the standard Pig as dependency. On that, we should agree.

 As for my use of Pig, I have no limitation. I am however interested to see
 the rise of a 'no-sql high level non programming language' for Spark.

 @Zhang : Could you elaborate your reference about Twitter?


 Bertrand Dechoux


 On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 pelickzh...@qq.com wrote:

 Hi guys, previously I checked out the old spork and updated it to
 Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine
 https://github.com/pelick/flare-spork‍

 It it also highly experimental, and just directly mapping pig physical
 operations to spark RDD transformations/actions. It works for simple
 requests. :)

 I am also interested on the progress of spork, is it undergoing in
 Twitter in an un open-source way?

 --
 Thanks
 Zhang Baofeng
 Blog http://blog.csdn.net/pelick | Github https://github.com/pelick
 | Weibo http://weibo.com/pelickzhang | LinkedIn
 http://www.linkedin.com/pub/zhang-baofeng/70/609/84




 -- 原始邮件 --
 *发件人:* Mayur Rustagi;mayur.rust...@gmail.com;
 *发送时间:* 2014年7月7日(星期一) 晚上11:55
 *收件人:* user@spark.apache.orguser@spark.apache.org;
 *主题:* Re: Pig 0.13, Spark, Spork

 That version is old :).
 We are not forking pig but cleanly separating out pig execution engine.
 Let me know if you are willing to give it a go.

 Also would love to know what features of pig you are using ?

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 I saw a wiki page from your company but with an old version of Spark.

 http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

 I have no reason to use it yet but I am interested in the state of the
 initiative.
 What's your point of view (personal and/or professional) about the Pig
 0.13 release?
 Is the pluggable execution engine flexible enough in order to avoid
 having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

 As a (for now) external observer, I am glad to see competition in that
 space. It can only be good for the community in the end.

 Bertrand Dechoux


 On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Hi,
 We have fixed many major issues around Spork  deploying it with some
 customers. Would be happy to provide a working version to you to try out.
 We are looking for more folks to try it out  submit bugs.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 Hi,

 I was wondering what was the state of the Pig+Spark initiative now
 that the execution engine of Pig is pluggable? Granted, it was done in
 order to use Tez but could it be used by Spark? I know about a
 'theoretical' project called Spork but I don't know any stable and
 maintained version of it.

 Regards

 Bertrand Dechoux








Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
Hi,

I was wondering what was the state of the Pig+Spark initiative now that the
execution engine of Pig is pluggable? Granted, it was done in order to use
Tez but could it be used by Spark? I know about a 'theoretical' project
called Spork but I don't know any stable and maintained version of it.

Regards

Bertrand Dechoux


Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
Hi,
We have fixed many major issues around Spork  deploying it with some
customers. Would be happy to provide a working version to you to try out.
We are looking for more folks to try it out  submit bugs.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote:

 Hi,

 I was wondering what was the state of the Pig+Spark initiative now that
 the execution engine of Pig is pluggable? Granted, it was done in order to
 use Tez but could it be used by Spark? I know about a 'theoretical' project
 called Spork but I don't know any stable and maintained version of it.

 Regards

 Bertrand Dechoux



Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
That version is old :).
We are not forking pig but cleanly separating out pig execution engine. Let
me know if you are willing to give it a go.

Also would love to know what features of pig you are using ?

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote:

 I saw a wiki page from your company but with an old version of Spark.

 http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1

 I have no reason to use it yet but I am interested in the state of the
 initiative.
 What's your point of view (personal and/or professional) about the Pig
 0.13 release?
 Is the pluggable execution engine flexible enough in order to avoid having
 Spork as a fork of Pig? Pig + Spark + Fork = Spork :D

 As a (for now) external observer, I am glad to see competition in that
 space. It can only be good for the community in the end.

 Bertrand Dechoux


 On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com
 wrote:

 Hi,
 We have fixed many major issues around Spork  deploying it with some
 customers. Would be happy to provide a working version to you to try out.
 We are looking for more folks to try it out  submit bugs.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com
 wrote:

 Hi,

 I was wondering what was the state of the Pig+Spark initiative now that
 the execution engine of Pig is pluggable? Granted, it was done in order to
 use Tez but could it be used by Spark? I know about a 'theoretical' project
 called Spork but I don't know any stable and maintained version of it.

 Regards

 Bertrand Dechoux






Re: Pig 0.13, Spark, Spork

2014-07-07 Thread 张包峰
Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0, 
Scala 2.10.3 and Spark 0.9.1, see github project of mine 
https://github.com/pelick/flare-spork‍


It it also highly experimental, and just directly mapping pig physical 
operations to spark RDD transformations/actions. It works for simple requests. 
:)


I am also interested on the progress of spork, is it undergoing in Twitter in 
an un open-source way?


--
Thanks
Zhang Baofeng
Blog | Github | Weibo | LinkedIn




 




-- 原始邮件 --
发件人: Mayur Rustagi;mayur.rust...@gmail.com;
发送时间: 2014年7月7日(星期一) 晚上11:55
收件人: user@spark.apache.orguser@spark.apache.org; 

主题: Re: Pig 0.13, Spark, Spork



That version is old :). We are not forking pig but cleanly separating out pig 
execution engine. Let me know if you are willing to give it a go.



Also would love to know what features of pig you are using ? 
 


Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com
 @mayur_rustagi





 

On Mon, Jul 7, 2014 at 8:46 PM, Bertrand Dechoux decho...@gmail.com wrote:
 I saw a wiki page from your company but with an old version of 
Spark.http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
 


I have no reason to use it yet but I am interested in the state of the 
initiative.
What's your point of view (personal and/or professional) about the Pig 0.13 
release?
Is the pluggable execution engine flexible enough in order to avoid having 
Spork as a fork of Pig? Pig + Spark + Fork = Spork :D
 

As a (for now) external observer, I am glad to see competition in that space. 
It can only be good for the community in the end.

 Bertrand Dechoux
 

On Mon, Jul 7, 2014 at 5:00 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
 Hi,We have fixed many major issues around Spork  deploying it with some 
customers. Would be happy to provide a working version to you to try out. We 
are looking for more folks to try it out  submit bugs. 
 

Regards
Mayur 


Mayur Rustagi
Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com
 @mayur_rustagi





 

On Mon, Jul 7, 2014 at 8:21 PM, Bertrand Dechoux decho...@gmail.com wrote:
 Hi,

I was wondering what was the state of the Pig+Spark initiative now that the 
execution engine of Pig is pluggable? Granted, it was done in order to use Tez 
but could it be used by Spark? I know about a 'theoretical' project called 
Spork but I don't know any stable and maintained version of it.
 

Regards

Bertrand Dechoux

Re: Pig on Spark

2014-04-25 Thread suman bharadwaj
Hey Mayur,

We use HiveColumnarLoader and XMLLoader. Are these working as well ?

Will try few things regarding porting Java MR.

Regards,
Suman Bharadwaj S


On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Right now UDF is not working. Its in the top list though. You should be
 able to soon :)
 Are thr any other functionality of pig you use often apart from the usual
 suspects??

 Existing Java MR jobs would be a easier move. are these cascading jobs or
 single map reduce jobs. If single then you should be able to,  write a
 scala wrapper code code to call map  reduce functions with some magic 
 let your core code be. Would be interesting to see an actual example  get
 it to work.

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote:

 We currently are in the process of converting PIG and Java map reduce
 jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence
 was checking if we can leverage SPORK without converting to SPARK jobs.

 And is there any way I can port my existing Java MR jobs to SPARK ?
 I know this thread has a different subject, let me know if need to ask
 this question in separate thread.

 Thanks in advance.


 On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but
 they want something dead simple with which they can do dsl. They end up
 using Hive for a lot of ETL just cause its SQL  they understand it. Pig 
 is
 close  wraps up a lot of framework level semantics away from the user 
 lets him focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
 lot for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always
 gaps and enhancements. I was often thought is DSL right way to solve data
 flow problems? May be not, we need complete language construct. We may 
 have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power!
 If we can improve first 3 lines, here you go, you have most powerful DSL 
 to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and
 the goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM

Re: Pig on Spark

2014-04-25 Thread Mark Baker
I've only had a quick look at Pig, but it seems that a declarative
layer on top of Spark couldn't be anything other than a big win, as it
allows developers to declare *what* they want, permitting the compiler
to determine how best poke at the RDD API to implement it.

In my brief time with Spark, I've often thought that it feels very
unnatural to use imperative code to declare a pipeline.


Re: Pig on Spark

2014-04-25 Thread Eugen Cepoi
It depends, personally I have the opposite opinion.

IMO expressing pipelines in a functional language feels natural, you just
have to get used with the language (scala).

Testing spark jobs is easy where testing a Pig script is much harder and
not natural.

If you want a more high level language that deals with RDDs for you, you
can use spark sql
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

Of course you can express less things this way, but if you have some
complex logic I think it would make sense to write a classic spark job that
would be more robust in the long term.


2014-04-25 15:30 GMT+02:00 Mark Baker dist...@acm.org:

 I've only had a quick look at Pig, but it seems that a declarative
 layer on top of Spark couldn't be anything other than a big win, as it
 allows developers to declare *what* they want, permitting the compiler
 to determine how best poke at the RDD API to implement it.

 In my brief time with Spark, I've often thought that it feels very
 unnatural to use imperative code to declare a pipeline.



Re: Pig on Spark

2014-04-25 Thread Bharath Mundlapudi
 I've only had a quick look at Pig, but it seems that a declarative
 layer on top of Spark couldn't be anything other than a big win, as it
 allows developers to declare *what* they want, permitting the compiler
 to determine how best poke at the RDD API to implement it.

The devil is in the details - allowing developers to declare *what* they
want - seems not practical in a declarative world since we are bound by the
DSL constructs. The work around or rather hack is to have UDFs to have full
language constructs. Some problems are hard, you will have twist your mind
to solve in a restrictive way. At that time, we think, we wish we have
complete language power.

Being in Big Data world for short time (7 years), seen enough problems with
Hive/Pig. All I am providing here is a thought to spark the Spark community
to think beyond declarative constructs.

I am sure there is a place for Pig and Hive.

-Bharath




On Fri, Apr 25, 2014 at 10:21 AM, Michael Armbrust
mich...@databricks.comwrote:

 On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker dist...@acm.org wrote:

 I've only had a quick look at Pig, but it seems that a declarative
 layer on top of Spark couldn't be anything other than a big win, as it
 allows developers to declare *what* they want, permitting the compiler
 to determine how best poke at the RDD API to implement it.


 Having Pig too would certainly be a win, but Spark 
 SQLhttp://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.htmlis
  also a declarative layer on top of Spark.  Since the optimization is
 lazy, you can chain multiple SQL statements in a row and still optimize
 them holistically (similar to a pig job).  Alpha version coming soon to a
 Spark 1.0 release near you!

 Spark SQL also lets to drop back into functional Scala when that is more
 natural for a particular task.



Re: Pig on Spark

2014-04-23 Thread lalit1303
Hi,

We got spork working on spark 0.9.0
Repository available at:
https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix

Please suggest your feedback.



-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p4668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something dead simple with which they can do dsl. They end up using
Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
 wraps up a lot of framework level semantics away from the user  lets him
focus on data flow
2. Some have codebases in Pig already  are just looking to do it faster. I
am yet to benchmark that on Pig on spark.

I agree that pig on spark cannot solve a lot problems but it can solve some
without forcing the end customer to do anything even close to coding, I
believe thr is quite some value in making Spark accessible to larger group
of audience.
End of the day to each his own :)

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.comwrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps and
 enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If we
 can improve first 3 lines, here you go, you have most powerful DSL to solve
 data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you please
 keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out would
 love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back (Jan
  22nd).
 
  See the archives:
  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from several
  other groups, and
  if there's enough of it, maybe we can start another open source repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack week.)
 That
  work also calls
  the Scala API directly, because it was done before we had a Java API; it
  should be easier
  with the Java one.
 
 
  Tom
 
 
 
  On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi everyone,
 
  We are using to Pig

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs
?

Thanks.


On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it faster.
 I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.comwrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you please
 keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out would
 love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
  http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from several
  other groups, and
  if there's enough of it, maybe we can start another open source repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack week.)
 That
  work also calls
  the Scala API directly, because it was done before we

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
UDF
Generate
 many many more are not working :)

Several of them work. Joins, filters, group by etc.
I am translating the ones we need, would be happy to get help on others.
Will host a jira to track them if you are intersted.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it faster.
 I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com
  wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at https://github.com/aniket486/pig/blob/spork/pig-sparkto
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.

And is there any way I can port my existing Java MR jobs to SPARK ?
I know this thread has a different subject, let me know if need to ask this
question in separate thread.

Thanks in advance.


On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but they
 want something dead simple with which they can do dsl. They end up using
 Hive for a lot of ETL just cause its SQL  they understand it. Pig is close
  wraps up a lot of framework level semantics away from the user  lets him
 focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot
 for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power! If
 we can improve first 3 lines, here you go, you have most powerful DSL to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Right now UDF is not working. Its in the top list though. You should be
able to soon :)
Are thr any other functionality of pig you use often apart from the usual
suspects??

Existing Java MR jobs would be a easier move. are these cascading jobs or
single map reduce jobs. If single then you should be able to,  write a
scala wrapper code code to call map  reduce functions with some magic 
let your core code be. Would be interesting to see an actual example  get
it to work.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote:

 We currently are in the process of converting PIG and Java map reduce jobs
 to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
 checking if we can leverage SPORK without converting to SPARK jobs.

 And is there any way I can port my existing Java MR jobs to SPARK ?
 I know this thread has a different subject, let me know if need to ask
 this question in separate thread.

 Thanks in advance.


 On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 UDF
 Generate
  many many more are not working :)

 Several of them work. Joins, filters, group by etc.
 I am translating the ones we need, would be happy to get help on others.
 Will host a jira to track them if you are intersted.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote:

 Are all the features available in PIG working in SPORK ?? Like for eg:
 UDFs ?

 Thanks.


 On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Thr are two benefits I get as of now
 1. Most of the time a lot of customers dont want the full power but
 they want something dead simple with which they can do dsl. They end up
 using Hive for a lot of ETL just cause its SQL  they understand it. Pig is
 close  wraps up a lot of framework level semantics away from the user 
 lets him focus on data flow
 2. Some have codebases in Pig already  are just looking to do it
 faster. I am yet to benchmark that on Pig on spark.

 I agree that pig on spark cannot solve a lot problems but it can solve
 some without forcing the end customer to do anything even close to coding,
 I believe thr is quite some value in making Spark accessible to larger
 group of audience.
 End of the day to each his own :)

 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi 
 mundlap...@gmail.com wrote:

 This seems like an interesting question.

 I love Apache Pig. It is so natural and the language flows with nice
 syntax.

 While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
 lot for analytics and provided feedback to Pig Team to do much more
 functionality when it was at version 0.7. Lots of new functionality got
 offered now
 .
 End of the day, Pig is a DSL for data flows. There will be always gaps
 and enhancements. I was often thought is DSL right way to solve data flow
 problems? May be not, we need complete language construct. We may have
 found the answer - Scala. With Scala's dynamic compilation, we can write
 much power constructs than any DSL can provide.

 If I am a new organization and beginning to choose, I would go with
 Scala.

 Here is the example:

 #!/bin/sh
 exec scala $0 $@
 !#
 YOUR DSL GOES HERE BUT IN SCALA!

 You have DSL like scripting, functional and complete language power!
 If we can improve first 3 lines, here you go, you have most powerful DSL 
 to
 solve data problems.

 -Bharath





 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves

Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur,

I wondered if you could share your findings in some way (github, blog post,
etc). I guess your experience will be very interesting/useful for many
people

sent from Lenovo YogaTablet
On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Hi Ankit,
 Thanx for all the work on Pig.
 Finally got it working. Couple of high level bugs right now:

- Getting it working on Spark 0.9.0
- Getting UDF working
- Getting generate functionality working
- Exhaustive test suite on Spark on Pig

 are you maintaining a Jira somewhere?

 I am currently trying to deploy it on 0.9.0.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote:

 We will post fixes from our side at - https://github.com/twitter/pig.

 Top on our list are-
 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
 0.9 spark).
 2. Support for algebraic udfs (this mitigates the group by oom problems).

 Would definitely love more contribution on this.

 Thanks,
 Aniket


 On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Dam I am off to NY for Structure Conf. Would it be possible to meet
 anytime after 28th March?
 I am really interested in making it stable  production quality.

 Regards
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote:

 Hi Mayur,
 Are you going to the Pig meetup this afternoon?
 http://www.meetup.com/PigUser/events/160604192/
 Aniket and I will be there.
 We would be happy to chat about Pig-on-Spark



 On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Hi Lin,
 We are working on getting Pig on spark functional with 0.8.0, have you
 got it working on any spark version ?
 Also what all functionality works on it?
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this,
 I will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from
 several
  other groups, and
  if there's enough of it, maybe we can start another open source
 repo to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack
 week.) That
  work also calls
  the Scala API directly, because it was done before we

Re: Pig on Spark

2014-04-10 Thread Mayur Rustagi
Bam !!!
http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Thu, Apr 10, 2014 at 3:07 AM, Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi Mayur,

 I wondered if you could share your findings in some way (github, blog
 post, etc). I guess your experience will be very interesting/useful for
 many people

 sent from Lenovo YogaTablet
 On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Hi Ankit,
 Thanx for all the work on Pig.
 Finally got it working. Couple of high level bugs right now:

- Getting it working on Spark 0.9.0
- Getting UDF working
- Getting generate functionality working
- Exhaustive test suite on Spark on Pig

 are you maintaining a Jira somewhere?

 I am currently trying to deploy it on 0.9.0.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.comwrote:

 We will post fixes from our side at - https://github.com/twitter/pig.

 Top on our list are-
 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
 0.9 spark).
 2. Support for algebraic udfs (this mitigates the group by oom problems).

 Would definitely love more contribution on this.

 Thanks,
 Aniket


 On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Dam I am off to NY for Structure Conf. Would it be possible to meet
 anytime after 28th March?
 I am really interested in making it stable  production quality.

 Regards
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote:

 Hi Mayur,
 Are you going to the Pig meetup this afternoon?
 http://www.meetup.com/PigUser/events/160604192/
 Aniket and I will be there.
 We would be happy to chat about Pig-on-Spark



 On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi 
 mayur.rust...@gmail.com wrote:

 Hi Lin,
 We are working on getting Pig on spark functional with 0.8.0, have
 you got it working on any spark version ?
 Also what all functionality works on it?
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and
 the goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't
 been able to
  clean this up- in-progress). There are few known issues with this,
 I will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while
 back (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from
 several
  other groups, and
  if there's enough of it, maybe we

Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi
Hi Ankit,
Thanx for all the work on Pig.
Finally got it working. Couple of high level bugs right now:

   - Getting it working on Spark 0.9.0
   - Getting UDF working
   - Getting generate functionality working
   - Exhaustive test suite on Spark on Pig

are you maintaining a Jira somewhere?

I am currently trying to deploy it on 0.9.0.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Fri, Mar 14, 2014 at 1:37 PM, Aniket Mokashi aniket...@gmail.com wrote:

 We will post fixes from our side at - https://github.com/twitter/pig.

 Top on our list are-
 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or
 0.9 spark).
 2. Support for algebraic udfs (this mitigates the group by oom problems).

 Would definitely love more contribution on this.

 Thanks,
 Aniket


 On Fri, Mar 14, 2014 at 12:29 PM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Dam I am off to NY for Structure Conf. Would it be possible to meet
 anytime after 28th March?
 I am really interested in making it stable  production quality.

 Regards
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Mar 14, 2014 at 11:53 AM, Julien Le Dem jul...@twitter.comwrote:

 Hi Mayur,
 Are you going to the Pig meetup this afternoon?
 http://www.meetup.com/PigUser/events/160604192/
 Aniket and I will be there.
 We would be happy to chat about Pig-on-Spark



 On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi 
 mayur.rust...@gmail.comwrote:

 Hi Lin,
 We are working on getting Pig on spark functional with 0.8.0, have you
 got it working on any spark version ?
 Also what all functionality works on it?
 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote:

 Hi Sameer,

 Lin (cc'ed) could also give you some updates about Pig on Spark
 development on her side.

 Best,
 Xiangrui

 On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi Mayur,
  We are planning to upgrade our distribution MR1 MR2 (YARN) and the
 goal is
  to get SPROK set up next month. I will keep you posted. Can you
 please keep
  me informed about your progress as well.
 
  
  From: mayur.rust...@gmail.com
  Date: Mon, 10 Mar 2014 11:47:56 -0700
 
  Subject: Re: Pig on Spark
  To: user@spark.apache.org
 
 
  Hi Sameer,
  Did you make any progress on this. My team is also trying it out
 would love
  to know some detail so progress.
 
  Mayur Rustagi
  Ph: +1 (760) 203 3257
  http://www.sigmoidanalytics.com
  @mayur_rustagi
 
 
 
  On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
 wrote:
 
  Hi Aniket,
  Many thanks! I will check this out.
 
  
  Date: Thu, 6 Mar 2014 13:46:50 -0800
  Subject: Re: Pig on Spark
  From: aniket...@gmail.com
  To: user@spark.apache.org; tgraves...@yahoo.com
 
 
  There is some work to make this work on yarn at
  https://github.com/aniket486/pig. (So, compile pig with ant
  -Dhadoopversion=23)
 
  You can look at
 https://github.com/aniket486/pig/blob/spork/pig-spark to
  find out what sort of env variables you need (sorry, I haven't been
 able to
  clean this up- in-progress). There are few known issues with this, I
 will
  work on fixing them soon.
 
  Known issues-
  1. Limit does not work (spork-fix)
  2. Foreach requires to turn off schema-tuple-backend (should be a
 pig-jira)
  3. Algebraic udfs dont work (spork-fix in-progress)
  4. Group by rework (to avoid OOMs)
  5. UDF Classloader issue (requires SPARK-1053, then you can put
  pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
 jars)
 
  ~Aniket
 
 
 
 
  On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com
 wrote:
 
  I had asked a similar question on the dev mailing list a while back
 (Jan
  22nd).
 
  See the archives:
 
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser-
  look for spork.
 
  Basically Matei said:
 
  Yup, that was it, though I believe people at Twitter picked it up
 again
  recently. I'd suggest
  asking Dmitriy if you know him. I've seen interest in this from
 several
  other groups, and
  if there's enough of it, maybe we can start another open source repo
 to
  track it. The work
  in that repo you pointed to was done over one week, and already had
 most of
  Pig's operators
  working. (I helped out with this prototype over Twitter's hack
 week.) That
  work also calls
  the Scala API directly, because it was done before we had a Java
 API; it
  should be easier
  with the Java one.
 
 
  Tom
 
 
 
  On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com
 wrote:
  Hi everyone,
 
  We are using to Pig to build our data pipeline. I came across Spork
 -- Pig
  on Spark at: https://github.com

Re: Pig on Spark

2014-03-25 Thread lalit1303
Hi,

I have been following Aniket's spork github repository.
https://github.com/aniket486/pig
I have done all the changes mentioned in recently modified pig-spark file.

I am using:
hadoop 2.0.5 alpha
spark-0.8.1-incubating
mesos 0.16.0

##PIG variables
export *HADOOP_CONF_DIR*=$HADOOP_INSTALL/etc/hadoop
export *SPARK_YARN_APP_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar
export *SPARK_JAVA_OPTS*= -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heap.dump
export
*SPARK_JAR*=/home/ubuntu/spark/assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.0.5-alpha.jar
export *SPARK_MASTER*=yarn-client
export *SPARK_HOME*=/home/ubuntu/spark
export *SPARK_JARS*=/home/ubuntu/pig/contrib/piggybank/java/piggybank.jar
export
*PIG_CLASSPATH*=${SPARK_JAR}:${SPARK_JARS}:/home/ubuntu/mesos/build/src/mesos-0.16.0.jar:/home/ubuntu/pig/pig-withouthadoop.jar
export *SPARK_PIG_JAR*=/home/ubuntu/pig/pig-withouthadoop.jar


This works fine in Mapreduce and local mode. But, while running on spark
mode I am facing follwing error. This error come after the job is submitted
and run on yarn-master.
Can you please tell me how to proceed.

###error
message

*ERROR 2998*: *Unhandled internal error. class
org.apache.spark.util.InnerClosureFinder* has interface
org.objectweb.asm.ClassVisitor as super class

*java.lang.IncompatibleClassChangeError*: class
org.apache.spark.util.InnerClosureFinder has interface
org.objectweb.asm.ClassVisitor as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:643)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at
org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:87)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
at org.apache.spark.SparkContext.clean(SparkContext.scala:970)
at org.apache.spark.rdd.RDD.map(RDD.scala:246)
at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:68)
at
org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter.convert(LoadConverter.java:38)
at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:212)
at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201)
at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:201)
at
org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:125)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1328)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1310)
at org.apache.pig.PigServer.storeEx(PigServer.java:993)
at org.apache.pig.PigServer.store(PigServer.java:957)
at org.apache.pig.PigServer.openIterator(PigServer.java:870)
at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:729)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:370)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:609)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-tp2367p3187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Pig on Spark

2014-03-10 Thread Sameer Tilak
Hi Mayur,We are planning to upgrade our distribution MR1 MR2 (YARN) and the 
goal is to get SPROK set up next month. I will keep you posted. Can you please 
keep me informed about your progress as well.
From: mayur.rust...@gmail.com
Date: Mon, 10 Mar 2014 11:47:56 -0700
Subject: Re: Pig on Spark
To: user@spark.apache.org

Hi Sameer,Did you make any progress on this. My team is also trying it out 
would love to know some detail so progress. Mayur Rustagi


Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi





On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com wrote:





Hi Aniket,Many thanks! I will check this out.

Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket...@gmail.com


To: user@spark.apache.org; tgraves...@yahoo.com

There is some work to make this work on yarn at 
https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23)


You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find 
out what sort of env variables you need (sorry, I haven't been able to clean 
this up- in-progress). There are few known issues with this, I will work on 
fixing them soon.



Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off 
schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work 
(spork-fix in-progress)


4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires 
SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in 
SparkContext along with udf jars)
~Aniket






On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote:


I had asked a similar question on the dev mailing list a while back (Jan 22nd). 



See the archives: 
http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look 
for spork.






Basically Matei said:



Yup, that was it, though I believe people at Twitter picked it up again 
recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other 
groups, and
if there’s enough of it, maybe we can start another open source repo to track 
it. The work
in that repo you pointed to was done over one week, and already had most of 
Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work 
also calls
the Scala API directly, because it was done before we had a Java API; it should 
be easier
with the Java one.
Tom 
 
 


On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:






Hi everyone,



We are using to Pig to build our data pipeline. I came across Spork -- Pig on 
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   


Can someone please let me know the status of Spork or any other effort that 
will let us run Pig on Spark? We can significantly benefit by using Spark, but 
we would like to keep using the existing Pig scripts.   
   





  

-- 
...:::Aniket:::... Quetzalco@tl
  

  

Re: PIG to SPARK

2014-03-06 Thread suman bharadwaj
Thanks Mayur. I don't have clear idea on how pipe works wanted to
understand more on it. But when do we use pipe() and how it works ?. Can
you please share some sample code if you have ( even pseudo-code is fine )
? It will really help.

Regards,
Suman Bharadwaj S


On Thu, Mar 6, 2014 at 3:46 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:

 The real question is why do you want to run pig script using Spark
 Are you planning to user spark as underlying processing engine for Spark?
 thats not simple
 Are you planning to feed Pig data to spark for further processing, then
 you can write it to HDFS  trigger your spark script.

 rdd.pipe is basically similar to Hadoop streaming, allowing you to run a
 script on each partition of the RDD  get output as another RDD.
 Regards
 Mayur


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Mar 5, 2014 at 10:29 AM, suman bharadwaj suman@gmail.comwrote:

 Hi,

 How can i call pig script using SPARK. Can I use rdd.pipe() here ?

 And can anyone share sample implementation of rdd.pipe () and if you can
 explain how rdd.pipe() works, it would really really help.

 Regards,
 SB





Pig on Spark

2014-03-06 Thread Sameer Tilak
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on 
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   
Can someone please let me know the status of Spork or any other effort that 
will let us run Pig on Spark? We can significantly benefit by using Spark, but 
we would like to keep using the existing Pig scripts.   
 

Re: Pig on Spark

2014-03-06 Thread Tom Graves
I had asked a similar question on the dev mailing list a while back (Jan 22nd). 

See the archives: 
http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look 
for spork.

Basically Matei said:

Yup, that was it, though I believe people at Twitter picked it up again 
recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other 
groups, and
if there’s enough of it, maybe we can start another open source repo to track 
it. The work
in that repo you pointed to was done over one week, and already had most of 
Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work 
also calls
the Scala API directly, because it was done before we had a Java API; it should 
be easier
with the Java one.

Tom



On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:
 
 
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on 
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   

Can someone please let me know the status of Spork or any other effort that 
will let us run Pig on Spark? We can significantly benefit by using Spark, but 
we would like to keep using the existing Pig scripts.  

Re: Pig on Spark

2014-03-06 Thread Aniket Mokashi
There is some work to make this work on yarn at
https://github.com/aniket486/pig. (So, compile pig with ant
-Dhadoopversion=23)

You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
find out what sort of env variables you need (sorry, I haven't been able to
clean this up- in-progress). There are few known issues with this, I will
work on fixing them soon.

Known issues-
1. Limit does not work (spork-fix)
2. Foreach requires to turn off schema-tuple-backend (should be a pig-jira)
3. Algebraic udfs dont work (spork-fix in-progress)
4. Group by rework (to avoid OOMs)
5. UDF Classloader issue (requires SPARK-1053, then you can put
pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)

~Aniket




On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote:

 I had asked a similar question on the dev mailing list a while back (Jan
 22nd).

 See the archives:
 http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser -
 look for spork.

 Basically Matei said:

 Yup, that was it, though I believe people at Twitter picked it up again 
 recently. I'd suggest
 asking Dmitriy if you know him. I've seen interest in this from several other 
 groups, and
 if there's enough of it, maybe we can start another open source repo to track 
 it. The work
 in that repo you pointed to was done over one week, and already had most of 
 Pig's operators
 working. (I helped out with this prototype over Twitter's hack week.) That 
 work also calls
 the Scala API directly, because it was done before we had a Java API; it 
 should be easier
 with the Java one.


 Tom



   On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com
 wrote:
   Hi everyone,

 We are using to Pig to build our data pipeline. I came across Spork -- Pig
 on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
 active.

 Can someone please let me know the status of Spork or any other effort
 that will let us run Pig on Spark? We can significantly benefit by using
 Spark, but we would like to keep using the existing Pig scripts.





-- 
...:::Aniket:::... Quetzalco@tl


RE: Pig on Spark

2014-03-06 Thread Sameer Tilak
Hi Aniket,Many thanks! I will check this out.

Date: Thu, 6 Mar 2014 13:46:50 -0800
Subject: Re: Pig on Spark
From: aniket...@gmail.com
To: user@spark.apache.org; tgraves...@yahoo.com

There is some work to make this work on yarn at 
https://github.com/aniket486/pig. (So, compile pig with ant -Dhadoopversion=23)
You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to find 
out what sort of env variables you need (sorry, I haven't been able to clean 
this up- in-progress). There are few known issues with this, I will work on 
fixing them soon.

Known issues-1. Limit does not work (spork-fix)2. Foreach requires to turn off 
schema-tuple-backend (should be a pig-jira)3. Algebraic udfs dont work 
(spork-fix in-progress)
4. Group by rework (to avoid OOMs)5. UDF Classloader issue (requires 
SPARK-1053, then you can put pig-withouthadoop.jar as SPARK_JARS in 
SparkContext along with udf jars)
~Aniket




On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves tgraves...@yahoo.com wrote:


I had asked a similar question on the dev mailing list a while back (Jan 22nd). 

See the archives: 
http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser - look 
for spork.


Basically Matei said:

Yup, that was it, though I believe people at Twitter picked it up again 
recently. I’d suggest
asking Dmitriy if you know him. I’ve seen interest in this from several other 
groups, and
if there’s enough of it, maybe we can start another open source repo to track 
it. The work
in that repo you pointed to was done over one week, and already had most of 
Pig’s operators
working. (I helped out with this prototype over Twitter’s hack week.) That work 
also calls
the Scala API directly, because it was done before we had a Java API; it should 
be easier
with the Java one.
Tom 
 
 
On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:




Hi everyone,

We are using to Pig to build our data pipeline. I came across Spork -- Pig on 
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.   
Can someone please let me know the status of Spork or any other effort that 
will let us run Pig on Spark? We can significantly benefit by using Spark, but 
we would like to keep using the existing Pig scripts.   
   



  

-- 
...:::Aniket:::... Quetzalco@tl
  

Re: PIG to SPARK

2014-03-05 Thread Mayur Rustagi
The real question is why do you want to run pig script using Spark
Are you planning to user spark as underlying processing engine for Spark?
thats not simple
Are you planning to feed Pig data to spark for further processing, then you
can write it to HDFS  trigger your spark script.

rdd.pipe is basically similar to Hadoop streaming, allowing you to run a
script on each partition of the RDD  get output as another RDD.
Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Mar 5, 2014 at 10:29 AM, suman bharadwaj suman@gmail.comwrote:

 Hi,

 How can i call pig script using SPARK. Can I use rdd.pipe() here ?

 And can anyone share sample implementation of rdd.pipe () and if you can
 explain how rdd.pipe() works, it would really really help.

 Regards,
 SB