[jira] [Updated] (GIRAPH-717) HiveJythonRunner with support for pure Jython value types.

Nitay Joffe (JIRA) Wed, 17 Jul 2013 09:29:26 -0700

     [ 
https://issues.apache.org/jira/browse/GIRAPH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nitay Joffe updated GIRAPH-717:
-------------------------------

    Description: 
This adds support for pure Jython jobs. Currently this runner is hooked up to 
work with Hive. I'll make it more generic later.

Running a Jython job is simply:

HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar> 
org.apache.giraph.hive.jython.HiveJythonRunner jython1.py [jython2.py] ...

You can pass in any number of scripts. They will be parsed in order and sent to 
all the workers using DistributedCache.

There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36

There are a few pieces to a Jython job, I'll go over each part here.

The HiveJythonRunner will call a function called "prepare(job)" from the Jython 
scripts. This is the entry point for configuring your job.

In this configuration you setup everything, such as your graph types (those 
IVEMM writables) and sets up the Hive vertex/edge inputs and output. Each graph 
type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message 
value implements Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable 
wrapper that serializes Jython values using Pickle (jython IO framework).

Your computation must implement JythonComputation. Note that this does not 
actually implement Computation, but rather is a separate class so that we can 
wrap all the types passed in with a wrapper that implements Writable. The 
methods are named the same so that the user does not notice anything.

For Hive usage - if your value type is a primitive e.g. IntWritable or 
LongWritable, then you need not do anything. The Java code will automatically 
read/write the Hive table specified and convert between Hive types and the 
primitive Writable. The vertex_id type in the example works like this.
If your value is a custom Jython type, you must create classes which implement 
JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both). These 
objects read/write Jython types from Hive. There are wrappers in the Java code 
which take HiveIO data normally used in giraph-hive and turns them into Jython 
types. This means, for example, that getMap() will return a Jython dictionary 
instead of a Java Map.

There is also a PageRankBenchmark (from previous diff) implemented in Jython. 
Here's a run for comparison / sanity check:

PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
  https://gist.github.com/nitay/3170fa3b575d4d2e22a9
  total time: 302466
with this diff:
  https://gist.github.com/nitay/a52b6d1d64e50ab9829e
  total time: 306517
in jython:
  https://gist.github.com/nitay/3f2e758b2933c3521727
  total time: 434730

So we see that existing things are not affected (is there something else I 
should test?) and that Jython has around 40% overhead.

ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to 
split up :/)

  was:
This adds support for pure Jython jobs. Currently this runner is hooked up to 
work with Hive. I'll make it more generic later.

Running a Jython job is simply:

HIVE_HOME=<x>
HADOOP_HOME=<y>
$HIVE_HOME/bin/hive --service jar <giraph-hive-jar> 
org.apache.giraph.hive.jython.HiveJythonRunner [jython1.py] [jython2.py]

You can pass in any number of scripts. They will be parsed in order and sent to 
all the workers using DistributedCache.

There are examples and tests in the diff. Here is one example:
launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
worker: https://gist.github.com/nitay/7834fd2b059527e65a36

There are a few pieces to a Jython job, I'll go over each part here.

The launcher defines the graph types (those IVEMM writables) and sets up the 
Hive vertex/edge inputs and output. Each graph type is one of the following:
1) A Java type. For example the user can specify simply IntWritable
2) A Jython type that implements Writable. In the example above the message 
value implements Writable.
3) A pure Jython type. The Java code will wrap these objects in a Writable 
wrapper that serializes Jython values using Pickle (jython IO framework).

For Hive usage - if your value type is a primitive e.g. IntWritable or 
LongWritable, then you need not do anything. The Java code will automatically 
read/write the Hive table specified and convert between Hive types and the 
primitive Writable. The vertex_id type in the example works like this.
If your value is a custom Jython type, you must create classes which implement 
JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both). These 
objects read/write Jython types from Hive. There are wrappers in the Java code 
which take HiveIO data normally used in giraph-hive and turns them into Jython 
types. This means, for example, that getMap() will return a Jython dictionary 
instead of a Java Map.

There is also a PageRankBenchmark (from previous diff) implemented in Jython. 
Here's a run for comparison / sanity check:

PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute threads
trunk:
  https://gist.github.com/nitay/3170fa3b575d4d2e22a9
  total time: 302466
with this diff:
  https://gist.github.com/nitay/a52b6d1d64e50ab9829e
  total time: 306517
in jython:
  https://gist.github.com/nitay/3f2e758b2933c3521727
  total time: 434730

So we see that existing things are not affected (is there something else I 
should test?) and that Jython has around 40% overhead.

ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard to 
split up :/)

    
> HiveJythonRunner with support for pure Jython value types.
> ----------------------------------------------------------
>
>                 Key: GIRAPH-717
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-717
>             Project: Giraph
>          Issue Type: Bug
>            Reporter: Nitay Joffe
>            Assignee: Nitay Joffe
>
> This adds support for pure Jython jobs. Currently this runner is hooked up to 
> work with Hive. I'll make it more generic later.
> Running a Jython job is simply:
> HIVE_HOME=<x>
> HADOOP_HOME=<y>
> $HIVE_HOME/bin/hive --service jar <giraph-hive-jar> 
> org.apache.giraph.hive.jython.HiveJythonRunner jython1.py [jython2.py] ...
> You can pass in any number of scripts. They will be parsed in order and sent 
> to all the workers using DistributedCache.
> There are examples and tests in the diff. Here is one example:
> launcher: https://gist.github.com/nitay/a62e0a5d369a5e701fa3
> worker: https://gist.github.com/nitay/7834fd2b059527e65a36
> There are a few pieces to a Jython job, I'll go over each part here.
> The HiveJythonRunner will call a function called "prepare(job)" from the 
> Jython scripts. This is the entry point for configuring your job.
> In this configuration you setup everything, such as your graph types (those 
> IVEMM writables) and sets up the Hive vertex/edge inputs and output. Each 
> graph type is one of the following:
> 1) A Java type. For example the user can specify simply IntWritable
> 2) A Jython type that implements Writable. In the example above the message 
> value implements Writable.
> 3) A pure Jython type. The Java code will wrap these objects in a Writable 
> wrapper that serializes Jython values using Pickle (jython IO framework).
> Your computation must implement JythonComputation. Note that this does not 
> actually implement Computation, but rather is a separate class so that we can 
> wrap all the types passed in with a wrapper that implements Writable. The 
> methods are named the same so that the user does not notice anything.
> For Hive usage - if your value type is a primitive e.g. IntWritable or 
> LongWritable, then you need not do anything. The Java code will automatically 
> read/write the Hive table specified and convert between Hive types and the 
> primitive Writable. The vertex_id type in the example works like this.
> If your value is a custom Jython type, you must create classes which 
> implement JythonHiveReader/JythonHiveWriter (or JythonHiveIO which is both). 
> These objects read/write Jython types from Hive. There are wrappers in the 
> Java code which take HiveIO data normally used in giraph-hive and turns them 
> into Jython types. This means, for example, that getMap() will return a 
> Jython dictionary instead of a Java Map.
> There is also a PageRankBenchmark (from previous diff) implemented in Jython. 
> Here's a run for comparison / sanity check:
> PageRankBenchmark with 10 workers, 100M vertices, 10B edges, 10 compute 
> threads
> trunk:
>   https://gist.github.com/nitay/3170fa3b575d4d2e22a9
>   total time: 302466
> with this diff:
>   https://gist.github.com/nitay/a52b6d1d64e50ab9829e
>   total time: 306517
> in jython:
>   https://gist.github.com/nitay/3f2e758b2933c3521727
>   total time: 434730
> So we see that existing things are not affected (is there something else I 
> should test?) and that Jython has around 40% overhead.
> ReviewBoard: https://reviews.apache.org/r/12543/ (Sorry it's a big one, hard 
> to split up :/)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (GIRAPH-717) HiveJythonRunner with support for pure Jython value types.

Reply via email to