[jira] [Comment Edited] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

jay vyas (JIRA) Sun, 09 Nov 2014 19:07:22 -0800

    [ 
https://issues.apache.org/jira/browse/BIGTOP-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204270#comment-14204270
 ]


jay vyas edited comment on BIGTOP-1366 at 11/10/14 3:06 AM:
------------------------------------------------------------

Hi guys.  
*TL;DR : I reviewed it in spark 0.9 and created a driver bash script for 
submitting spark jobs to 0.9 since spark-submit isn't available.  we might need 
to have a couple minor mods to the spark driver to match 0.9 apis SparkContext. 
 I did this testing in pure bigtop 0.9 VMs. details below: *
 
Okay, I've cobbled together a "spark submit" type script based on some 
templates i found online for bigtop.  This will be the way we submit jobs for 
*spark 9x*.    When we upgrade to spark 1x we can use rj's exact README 
directions above.   

{noformat}
source /etc/spark/conf/spark-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/

# system jars:
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.1.jar:/bigtop-home/*jar:/usr/lib/spark/*:/usr/lib/spark/lib/*:/usr/lib/spark/assembly/lib/*

CONFIG_OPTS="-Dspark.master=local 
-Dspark.jars=target/sparkwordcount-0.0.1-SNAPSHOT.jar"

$JAVA_HOME/bin/java -cp $CLASSPATH $CONFIG_OPTS 
org.apache.spark.examples.SparkPi local 2 2
{noformat}

result: 
{noformat}
[vagrant@bigtop1 ~]$ ./submit.sh                                                
                                                             
Reading zipcode data
Read 30891 zipcode entries
Reading name data
Read 86987 first names and 47819 last names
Reading product data
Read 4 product categories
Generating stores...
Done.
Generating customers...
Done.
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.spark.SparkContext.<init>(Lorg/apache/spark/SparkConf;)V
        at 
com.github.rnowling.bps.datagenerator.spark.SparkDriver$.main(Driver.scala:45)
        at 
com.github.rnowling.bps.datagenerator.spark.SparkDriver.main(Driver.scala)
{noformat}

So we will need to possibly refactor the way *SparkContext* is instantiated for 
*0.9* api .  otherwise looks to work quite well, and the spark driver launches 
and gives *great* error messages, for missing *resources/* dir and so on.  
which i really like. I did this in bigtop VMs, and just copied the resources/* 
into {{bigtop-home}} .  


was (Author: jayunit100):
Hi guys.  Okay, I've cobbled together a "spark submit" type script based on 
some templates i found online for bigtop.  This will be the way we submit jobs 
for *spark 9x*.    When we upgrade to spark 1x we can use rj's exact README 
directions above.   

{noformat}
source /etc/spark/conf/spark-env.sh

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71.x86_64/

# system jars:
CLASSPATH=$CLASSPATH:$SPARK_HOME/assembly/lib/*

# app jar:
CLASSPATH=$CLASSPATH:/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.1.jar:/bigtop-home/*jar:/usr/lib/spark/*:/usr/lib/spark/lib/*:/usr/lib/spark/assembly/lib/*

CONFIG_OPTS="-Dspark.master=local 
-Dspark.jars=target/sparkwordcount-0.0.1-SNAPSHOT.jar"

$JAVA_HOME/bin/java -cp $CLASSPATH $CONFIG_OPTS 
org.apache.spark.examples.SparkPi local 2 2
{noformat}

result: 
{noformat}
[vagrant@bigtop1 ~]$ ./submit.sh                                                
                                                             
Reading zipcode data
Read 30891 zipcode entries
Reading name data
Read 86987 first names and 47819 last names
Reading product data
Read 4 product categories
Generating stores...
Done.
Generating customers...
Done.
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.spark.SparkContext.<init>(Lorg/apache/spark/SparkConf;)V
        at 
com.github.rnowling.bps.datagenerator.spark.SparkDriver$.main(Driver.scala:45)
        at 
com.github.rnowling.bps.datagenerator.spark.SparkDriver.main(Driver.scala)
{noformat}

So we will need to possibly refactor the way *SparkContext* is instantiated for 
*0.9* api .  otherwise looks to work quite well, and the spark driver launches 
and gives *great* error messages, for missing *resources/* dir and so on.  
which i really like. I did this in bigtop VMs, and just copied the resources/* 
into {{bigtop-home}} .  

> Updated, Richer Model for Generating Data for BigPetStore 
> ----------------------------------------------------------
>
>                 Key: BIGTOP-1366
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1366
>             Project: Bigtop
>          Issue Type: Improvement
>          Components: blueprints
>    Affects Versions: backlog
>            Reporter: RJ Nowling
>            Assignee: RJ Nowling
>            Priority: Minor
>   Original Estimate: 8,736h
>  Remaining Estimate: 8,736h
>
> BigPetStore uses synthetic data as the basis for its workflow.  BPS's current 
> model for generating customer data is sufficient for basic testing of the 
> Hadoop ecosystem, **but the model is very basic and lacks sufficient 
> complexity for embedding interesting patterns into the data**.  
> As a result, **more complex, scalable testing such as testing clustering 
> algorithms in Mahout on non-trivial data or multidimensional data with 
> factors influencing it** is not currently possible.
> Efforts are currently underway to incrementally improve the current model 
> (see BIGTOP-1271 and BIGTOP-1272).  
> To create a model that can that incorporate **realistic, non-hierarchichal 
> patterns** and input data to generate rich customer/transaction data with 
> interesting correlations will require a re-imagining of the current model and 
> its framework.
> To support the improvements to the model in BigPetStore, I have been working 
> on an **alternative ab initio model, developed from scratch**. Since the 
> development of a new model involves substantial R&D work with more 
> specialized tools (mathematical and plotting libraries), I'm doing the 
> current work outside of BPS using the iPython Notebook environment.  Due to 
> the long time frame, the model will be developed on a separate timeline to 
> prevent slowing the development of BPS.  
> Once the model has stabilized, I will begin incorporating the model into BPS 
> itself.  One option is to implement the model in using Scala for clean 
> integration with **spark** which is likely to play an increasingly important 
> role in the hadoop ecosystem, and thus will be an important part of 
> bigpetstore as a test/blueprint app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BIGTOP-1366) Updated, Richer Model for Generating Data for BigPetStore

Reply via email to