Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Xuelin Cao
Hi,

 Correct me if I were wrong. It looks like, the current version of
Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
operator produces a tuple by recursively call child-execute .

 There are papers that illustrate the benefits of vectorized query
engine. And Hive-Stinger also embrace this style.

 So, the question is, will Spark-SQL give a support to vectorized query
execution someday?

 Thanks


GraphX ShortestPaths backwards?

2015-01-19 Thread Michael Malak
GraphX ShortestPaths seems to be following edges backwards instead of forwards:

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), 
sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L,

lib.ShortestPaths.run(g,Array(3)).vertices.collect
res1: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 
- 0)), (2,Map()))

lib.ShortestPaths.run(g,Array(1)).vertices.collect

res2: Array[(org.apache.spark.graphx.VertexId, 
org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), 
(3,Map(1 - 2)), (2,Map(1 - 1)))

If I am not mistaken about my assessment, then I believe the following changes 
will make it run forward:

Change one occurrence of src to dst in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64

Change three occurrences of dst to src in
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Join the developer community of spark

2015-01-19 Thread Alessandro Baretta
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Enjoy!

Alex

On Mon, Jan 19, 2015 at 6:44 PM, Jeff Wang jingjingwang...@gmail.com
wrote:

 Hi:

 I would like to contribute to the code of spark. Can I join the community?

 Thanks,

 Jeff



Is there any way to support multiple users executing SQL on thrift server?

2015-01-19 Thread Yi Tian
Is there any way to support multiple users executing SQL on one thrift 
server?


I think there are some problems for spark 1.2.0, for example:

1. Start thrift server with user A
2. Connect to thrift server via beeline with user B
3. Execute “insert into table dest select … from table src”

then we found these items on hdfs:

|drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary
drwxr-xr-x   - B supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/_temporary
drwxr-xr-x   - A supergroup  0 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00
-rw-r--r--   3 A supergroup   2671 2015-01-16 16:42 
/tmp/hadoop/hive_2015-01-16_16-42-48_923_1860943684064616152-3/-ext-1/_temporary/0/task_201501161642_0022_m_00/part-0
|

You can see all the temporary path created on driver side (thrift server 
side) is owned by user B (which is what we expected).


But all the output data created on executor side is owned by user A, 
(which is NOT what we expected).
error owner of the output data cause 
|org.apache.hadoop.security.AccessControlException| while the driver 
side moving output data into |dest| table.


Is anyone know how to resolve this problem?

​


Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Reynold Xin
It will probably eventually make its way into part of the query engine, one
way or another. Note that there are in general a lot of other lower hanging
fruits before you have to do vectorization.

As far as I know, Hive doesn't really have vectorization because the
vectorization in Hive is simply writing everything in small batches, in
order to avoid the virtual function call overhead, and hoping the JVM can
unroll some of the loops. There is no SIMD involved.

Something that is pretty useful, which isn't exactly from vectorization but
comes from similar lines of research, is being able to push predicates down
into the columnar compression encoding. For example, one can turn string
comparisons into integer comparisons. These will probably give much larger
performance improvements in common queries.


On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao xuelincao2...@gmail.com wrote:

 Hi,

  Correct me if I were wrong. It looks like, the current version of
 Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
 operator produces a tuple by recursively call child-execute .

  There are papers that illustrate the benefits of vectorized query
 engine. And Hive-Stinger also embrace this style.

  So, the question is, will Spark-SQL give a support to vectorized query
 execution someday?

  Thanks



Re: Memory config issues

2015-01-19 Thread Sean Owen
On Mon, Jan 19, 2015 at 6:29 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Its the executor memory (spark.executor.memory) which you can set while
 creating the spark context. By default it uses 0.6% of the executor memory

(Uses 0.6 or 60%)

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark client reconnect to driver in yarn-cluster deployment mode

2015-01-19 Thread Romi Kuntsman
in yarn-client mode it only controls the environment of the executor
launcher

So you either use yarn-client mode, and then your app keeps running and
controlling the process
Or you use yarn-cluster mode, and then you send a jar to YARN, and that jar
should have code to report the result back to you

*Romi Kuntsman*, *Big Data Engineer*
 http://www.totango.com

On Thu, Jan 15, 2015 at 1:52 PM, preeze etan...@gmail.com wrote:

 From the official spark documentation
 (http://spark.apache.org/docs/1.2.0/running-on-yarn.html):

 In yarn-cluster mode, the Spark driver runs inside an application master
 process which is managed by YARN on the cluster, and the client can go away
 after initiating the application.

 Is there any designed way that the client connects back to the driver
 (still
 running in YARN) for collecting results at a later stage?



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-client-reconnect-to-driver-in-yarn-cluster-deployment-mode-tp10122.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Here are some timings showing effect of caching last Binary-String
conversion. Query times are reduced significantly and variation in timings
due to reduction in garbage is very significant.

Set of sample queries selecting various columns, applying some filtering and
then aggregating

Spark 1.2.0
Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis
Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis
Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis
Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis
Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis
Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis


Spark 1.2.0 - cache last Binary-String conversion
Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis
Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis
Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis
Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis
Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis
Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD order guarantees

2015-01-19 Thread Ewan Higgs

Hi Reynold.
I'll take a look.

SPARK-5300 is open for this issue.
-Ewan

On 19/01/15 08:39, Reynold Xin wrote:

Hi Ewan,

Not sure if there is a JIRA ticket (there are too many that I lose track).

I chatted briefly with Aaron on this. The way we can solve it is to 
create a new FileSystem implementation that overrides the listStatus 
method, and then in Hadoop Conf set the fs.file.impl to that.


Shouldn't be too hard. Would you be interested in working on it?




On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs ewan.hi...@ugent.be 
mailto:ewan.hi...@ugent.be wrote:


Yes, I am running on a local file system.

Is there a bug open for this? Mingyu Kim reported the problem last
April:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html

-Ewan


On 01/16/2015 07:41 PM, Reynold Xin wrote:

You are running on a local file system right? HDFS orders the
file based on names, but local file system often don't. I think
that's why the difference.

We might be able to do a sort and order the partitions when we
create a RDD to make this universal though.

On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs ewan.hi...@ugent.be
mailto:ewan.hi...@ugent.be wrote:

Hi all,
Quick one: when reading files, are the orders of partitions
guaranteed to be preserved? I am finding some weird behaviour
where I run sortByKeys() on an RDD (which has 16 byte keys)
and write it to disk. If I open a python shell and run the
following:

for part in range(29):
print map(ord,
open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
'r').read(16))

Then each partition is in order based on the first value of
each partition.

I can also call TeraValidate.validate from TeraSort and it is
happy with the results. It seems to be on loading the file
that the reordering happens. If this is expected, is there a
way to ask Spark nicely to give me the RDD in the order it
was saved?

This is based on trying to fix my TeraValidate code on this
branch:
https://github.com/ehiggs/spark/tree/terasort

Thanks,
Ewan

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
mailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
mailto:dev-h...@spark.apache.org









Re: Semantics of LGTM

2015-01-19 Thread Prashant Sharma
Patrick's original proposal LGTM :).  However until now, I have been in the
impression of LGTM with special emphasis on TM part. That said, I will be
okay/happy(or Responsible ) for the patch, if it goes in.

Prashant Sharma



On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote:

 Maybe just to avoid LGTM as a single token when it is not actually
 according to Patrick's definition, but anybody can still leave comments
 like:

 The direction of the PR looks good to me. or +1 on the direction

 The build part looks good to me

 ...


 On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
 I've
  heard the semantics of LGTM expressed as I've looked at this
 thoroughly
  and take as much ownership as if I wrote the patch myself.  My
  understanding is that this is the level of review we expect for all
 patches
  that ultimately go into Spark, so it's important to have a way to
 concisely
  describe when this has been done.
 
  Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
  cases I've seen, if someone else says I looked at this very quickly and
  didn't see any glaring problems, it doesn't add any value for subsequent
  reviewers (someone still needs to take a thorough look).
 
  -Kay
 
  On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote:
 
   Yeah, the ASF +1 has become partly overloaded to mean both I would
 like
   to see this feature and this patch should be committed, although, at
   least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
 vote)
   should unambiguously mean the latter unless qualified in some other
 way.
  
   I don't have any opinion on the specific characters, but I agree with
   Aaron that it would be nice to have some sort of abbreviation for both
  the
   strong and weak forms of approval.
  
   -Sandy
  
On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   
I think the ASF +1 is *slightly* different than Google's LGTM,
 because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to
 try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.
   
There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews before
it was donated to the ASF, so there is a mix of two styles.
   
Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google,
 and
some open source projects such as Impala) to indicate technical
sign-off.
   
- Patrick
   
On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson ilike...@gmail.com
 
   wrote:
I think I've seen something like +2 = strong LGTM and +1 = weak
  LGTM;
someone else should review before. It's nice to have a shortcut
 which
   isn't
a sentence when talking about weaker forms of LGTM.
   
On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
   
I think clarifying these semantics is definitely worthwhile. Maybe
  this
complicates the process with additional terminology, but the way
 I've
   used
these has been:
   
+1 - I think this is safe to merge and, barring objections from
  others,
would merge it immediately.
   
LGTM - I have no concerns about this patch, but I don't necessarily
   feel
qualified to make a final call about it.  The TM part acknowledges
  the
judgment as a little more subjective.
   
I think having some concise way to express both of these is useful.
   
-Sandy
   
On Jan 17, 2015, at 5:40 PM, Patrick Wendell pwend...@gmail.com
   wrote:
   
Hey All,
   
Just wanted to ping about a minor issue - but one that ends up
  having
consequence given Spark's volume of reviews and commits. As much
 as
possible, I think that we should try and gear towards Google
 Style
LGTM on reviews. What I mean by this is that LGTM has the
 following
semantics:
   
I know this code well, or I've looked at it close enough to feel
confident it should be merged. If there are issues/bugs with this
  code
later on, I feel confident I can help with them.
   
Here is an alternative semantic:
   
Based on what I know about this part of the code, I don't see any
show-stopper problems with this patch.
   
The issue with the latter is that it ultimately erodes the
significance of LGTM, since subsequent reviewers need to reason
  about
what the person meant by saying LGTM. In contrast, having strong
semantics around LGTM can help 

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies

Looking at Parquet code - it looks like hooks are already in place to
support this.

In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose. These are not used by
CatalystPrimitiveConverter.

I think that it would be pretty straightforward to add this. Has anyone
considered this? Shall I get a pull request  together for it.

Mick



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Reynold Xin
Definitely go for a pull request!


On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies michael.belldav...@gmail.com
wrote:


 Looking at Parquet code - it looks like hooks are already in place to
 support this.

 In particular PrimitiveConverter has methods hasDictionarySupport and
 addValueFromDictionary for this purpose. These are not used by
 CatalystPrimitiveConverter.

 I think that it would be pretty straightforward to add this. Has anyone
 considered this? Shall I get a pull request  together for it.

 Mick



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: GraphX vertex partition/location strategy

2015-01-19 Thread Ankur Dave
No - the vertices are hash-partitioned onto workers independently of the
edges. It would be nice for each vertex to be on the worker with the most
adjacent edges, but we haven't done this yet since it would add a lot of
complexity to avoid load imbalance while reducing the overall communication
by a small factor.

We refer to the number of partitions containing adjacent edges for a
particular vertex as the vertex's replication factor. I think the typical
replication factor for power-law graphs with 100-200 partitions is 10-15,
and placing the vertex at the ideal location would only reduce the
replication factor by 1.

Ankur http://www.ankurdave.com/

On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak 
michaelma...@yahoo.com.invalid wrote:

 Does GraphX make an effort to co-locate vertices onto the same workers as
 the majority (or even some) of its edges?



Re: GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
But wouldn't the gain be greater under something similar to EdgePartition1D 
(but perhaps better load-balanced based on number of edges for each vertex) and 
an algorithm that primarily follows edges in the forward direction?
  From: Ankur Dave ankurd...@gmail.com
 To: Michael Malak michaelma...@yahoo.com 
Cc: dev@spark.apache.org dev@spark.apache.org 
 Sent: Monday, January 19, 2015 2:08 PM
 Subject: Re: GraphX vertex partition/location strategy
   
No - the vertices are hash-partitioned onto workers independently of the edges. 
It would be nice for each vertex to be on the worker with the most adjacent 
edges, but we haven't done this yet since it would add a lot of complexity to 
avoid load imbalance while reducing the overall communication by a small factor.
We refer to the number of partitions containing adjacent edges for a particular 
vertex as the vertex's replication factor. I think the typical replication 
factor for power-law graphs with 100-200 partitions is 10-15, and placing the 
vertex at the ideal location would only reduce the replication factor by 1.

Ankur


On Mon, Jan 19, 2015 at 12:20 PM, Michael Malak 
michaelma...@yahoo.com.invalid wrote:

Does GraphX make an effort to co-locate vertices onto the same workers as the 
majority (or even some) of its edges?



   

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
The wiki does not seem to be operational ATM, but I will do this when
it is back up.

On Mon, Jan 19, 2015 at 12:00 PM, Patrick Wendell pwend...@gmail.com wrote:
 Okay - so given all this I was going to put the following on the wiki
 tentatively:

 ## Reviewing Code
 Community code review is Spark's fundamental quality assurance
 process. When reviewing a patch, your goal should be to help
 streamline the committing process by giving committers confidence this
 patch has been verified by an additional party. It's encouraged to
 (politely) submit technical feedback to the author to identify areas
 for improvement or potential bugs.

 If you feel a patch is ready for inclusion in Spark, indicate this to
 committers with a comment: I think this patch looks good. Spark uses
 the LGTM convention for indicating the highest level of technical
 sign-off on a patch: simply comment with the word LGTM. An LGTM is a
 strong statement, it should be interpreted as the following: I've
 looked at this thoroughly and take as much ownership as if I wrote the
 patch myself. If you comment LGTM you will be expected to help with
 bugs or follow-up issues on the patch. Judicious use of LGTM's is a
 great way to gain credibility as a reviewer with the broader
 community.

 It's also welcome for reviewers to argue against the inclusion of a
 feature or patch. Simply indicate this in the comments.

 - Patrick

 On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma scrapco...@gmail.com wrote:
 Patrick's original proposal LGTM :).  However until now, I have been in the
 impression of LGTM with special emphasis on TM part. That said, I will be
 okay/happy(or Responsible ) for the patch, if it goes in.

 Prashant Sharma



 On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote:

 Maybe just to avoid LGTM as a single token when it is not actually
 according to Patrick's definition, but anybody can still leave comments
 like:

 The direction of the PR looks good to me. or +1 on the direction

 The build part looks good to me

 ...


 On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
  I've
  heard the semantics of LGTM expressed as I've looked at this
  thoroughly
  and take as much ownership as if I wrote the patch myself.  My
  understanding is that this is the level of review we expect for all
  patches
  that ultimately go into Spark, so it's important to have a way to
  concisely
  describe when this has been done.
 
  Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
  cases I've seen, if someone else says I looked at this very quickly and
  didn't see any glaring problems, it doesn't add any value for
  subsequent
  reviewers (someone still needs to take a thorough look).
 
  -Kay
 
  On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote:
 
   Yeah, the ASF +1 has become partly overloaded to mean both I would
   like
   to see this feature and this patch should be committed, although,
   at
   least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
   vote)
   should unambiguously mean the latter unless qualified in some other
   way.
  
   I don't have any opinion on the specific characters, but I agree with
   Aaron that it would be nice to have some sort of abbreviation for both
  the
   strong and weak forms of approval.
  
   -Sandy
  
On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   
I think the ASF +1 is *slightly* different than Google's LGTM,
because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to
try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.
   
There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews
before
it was donated to the ASF, so there is a mix of two styles.
   
Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google,
and
some open source projects such as Impala) to indicate technical
sign-off.
   
- Patrick
   
On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson
ilike...@gmail.com
   wrote:
I think I've seen something like +2 = strong LGTM and +1 = weak
  LGTM;
someone else should review before. It's nice to have a shortcut
which
   isn't
a sentence when talking about weaker forms of LGTM.
   
On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
   
I think clarifying these semantics is definitely worthwhile. Maybe
  this

Re: Semantics of LGTM

2015-01-19 Thread Patrick Wendell
Okay - so given all this I was going to put the following on the wiki
tentatively:

## Reviewing Code
Community code review is Spark's fundamental quality assurance
process. When reviewing a patch, your goal should be to help
streamline the committing process by giving committers confidence this
patch has been verified by an additional party. It's encouraged to
(politely) submit technical feedback to the author to identify areas
for improvement or potential bugs.

If you feel a patch is ready for inclusion in Spark, indicate this to
committers with a comment: I think this patch looks good. Spark uses
the LGTM convention for indicating the highest level of technical
sign-off on a patch: simply comment with the word LGTM. An LGTM is a
strong statement, it should be interpreted as the following: I've
looked at this thoroughly and take as much ownership as if I wrote the
patch myself. If you comment LGTM you will be expected to help with
bugs or follow-up issues on the patch. Judicious use of LGTM's is a
great way to gain credibility as a reviewer with the broader
community.

It's also welcome for reviewers to argue against the inclusion of a
feature or patch. Simply indicate this in the comments.

- Patrick

On Mon, Jan 19, 2015 at 2:40 AM, Prashant Sharma scrapco...@gmail.com wrote:
 Patrick's original proposal LGTM :).  However until now, I have been in the
 impression of LGTM with special emphasis on TM part. That said, I will be
 okay/happy(or Responsible ) for the patch, if it goes in.

 Prashant Sharma



 On Sun, Jan 18, 2015 at 2:33 PM, Reynold Xin r...@databricks.com wrote:

 Maybe just to avoid LGTM as a single token when it is not actually
 according to Patrick's definition, but anybody can still leave comments
 like:

 The direction of the PR looks good to me. or +1 on the direction

 The build part looks good to me

 ...


 On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  +1 to Patrick's proposal of strong LGTM semantics.  On past projects,
  I've
  heard the semantics of LGTM expressed as I've looked at this
  thoroughly
  and take as much ownership as if I wrote the patch myself.  My
  understanding is that this is the level of review we expect for all
  patches
  that ultimately go into Spark, so it's important to have a way to
  concisely
  describe when this has been done.
 
  Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
  cases I've seen, if someone else says I looked at this very quickly and
  didn't see any glaring problems, it doesn't add any value for
  subsequent
  reviewers (someone still needs to take a thorough look).
 
  -Kay
 
  On Sat, Jan 17, 2015 at 8:04 PM, sandy.r...@cloudera.com wrote:
 
   Yeah, the ASF +1 has become partly overloaded to mean both I would
   like
   to see this feature and this patch should be committed, although,
   at
   least in Hadoop, using +1 on JIRA (as opposed to, say, in a release
   vote)
   should unambiguously mean the latter unless qualified in some other
   way.
  
   I don't have any opinion on the specific characters, but I agree with
   Aaron that it would be nice to have some sort of abbreviation for both
  the
   strong and weak forms of approval.
  
   -Sandy
  
On Jan 17, 2015, at 7:25 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   
I think the ASF +1 is *slightly* different than Google's LGTM,
because
it might convey wanting the patch/feature to be merged but not
necessarily saying you did a thorough review and stand behind it's
technical contents. For instance, I've seen people pile on +1's to
try
and indicate support for a feature or patch in some projects, even
though they didn't do a thorough technical review. This +1 is
definitely a useful mechanism.
   
There is definitely much overlap though in the meaning, though, and
it's largely because Spark had it's own culture around reviews
before
it was donated to the ASF, so there is a mix of two styles.
   
Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
proposed originally (unlike the one Sandy proposed, e.g.). This is
what I've seen every project using the LGTM convention do (Google,
and
some open source projects such as Impala) to indicate technical
sign-off.
   
- Patrick
   
On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson
ilike...@gmail.com
   wrote:
I think I've seen something like +2 = strong LGTM and +1 = weak
  LGTM;
someone else should review before. It's nice to have a shortcut
which
   isn't
a sentence when talking about weaker forms of LGTM.
   
On Sat, Jan 17, 2015 at 6:59 PM, sandy.r...@cloudera.com wrote:
   
I think clarifying these semantics is definitely worthwhile. Maybe
  this
complicates the process with additional terminology, but the way
I've
   used
these has been:
   
+1 - I think this is safe to merge and, barring objections from
  others,

GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
Does GraphX make an effort to co-locate vertices onto the same workers as the 
majority (or even some) of its edges?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org