RE: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Shouheng Yi Fri, 03 Mar 2017 13:02:11 -0800

Hi Spark dev list,

Thank you guys so much for all your inputs. We really appreciated those 
suggestions. After some discussions in the team, we decided to stay under 
apache’s namespace for now, and attach some comments to explain what we did and 
why we did this.


As the Spark dev list kindly pointed out, this is an existing issue that was 
documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA ticket 
to see if there are any new suggested practices that should be adopted in the 
future and make corresponding fixes.

Best,
Shouheng

[0] https://issues.apache.org/jira/browse/SPARK-19498

From: Tim Hunter [mailto:timhun...@databricks.com]
Sent: Friday, February 24, 2017 9:08 AM
To: Joseph Bradley <jos...@databricks.com>
Cc: Steve Loughran <ste...@hortonworks.com>; Shouheng Yi 
<sho...@microsoft.com.invalid>; Apache Spark Dev <dev@spark.apache.org>; Markus 
Weimer <mwei...@microsoft.com>; Rogan Carr <roc...@microsoft.com>; Pei Jiang 
<pej...@microsoft.com>; Miruna Oprescu <mopre...@microsoft.com>
Subject: Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgraphframes%2Fgraphframes%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fscala%2Forg%2Fgraphframes%2FLogging.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=lNT03ZybQOrEboWz0vuX4cic%2F5WGn49E464%2B1XbqdD8%3D&reserved=0>

Regarding the UDTs, they have been hidden to be reworked for Datasets, the 
reasons being detailed here [1]. Can you describe your use case in more 
details? You may be better off copy/pasting the UDT code outside of Spark, 
depending on your use case.

[1] 
https://issues.apache.org/jira/browse/SPARK-14155<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-14155&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=I5yFehqhf5qXMPXKQj8inZa3kXQwM3O2ntea3bFlge4%3D&reserved=0>

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
+1 for Nick's comment about discussing APIs which need to be made public in 
https://issues.apache.org/jira/browse/SPARK-19498<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-19498&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=jByKjOBuL9elEiJNJzxeoZ5euHDfinjqzj%2FJY5hn7Xo%3D&reserved=0>
 !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran 
<ste...@hortonworks.com<mailto:ste...@hortonworks.com>> wrote:

On 22 Feb 2017, at 20:51, Shouheng Yi 
<sho...@microsoft.com.INVALID<mailto:sho...@microsoft.com.INVALID>> wrote:

Hi Spark developers,

Currently my team at Microsoft is extending Spark’s machine learning 
functionalities to include new learners and transformers. We would like users 
to use these within spark pipelines so that they can mix and match with 
existing Spark learners/transformers, and overall have a native spark 
experience. We cannot accomplish this using a non-“org.apache” namespace with 
the current implementation, and we don’t want to release code inside the apache 
namespace because it’s confusing and there could be naming rights issues.

This isn't actually the ASF has a strong stance against, more left to projects 
themselves. After all: the source is licensed by the ASF, and the license 
doesn't say you can't.

Indeed, there's a bit of org.apache.hive in the Spark codebase where the hive 
team kept stuff package private. Though that's really a sign that things could 
be improved there.

Where is problematic is that stack traces end up blaming the wrong group; 
nobody likes getting a bug report which doesn't actually exist in your 
codebase., not least because you have to waste time to even work it out.

You also have to expect absolutely no stability guarantees, so you'd better set 
your nightly build to work against trunk

Apache Bahir does put some stuff into org.apache.spark.stream, but they've sort 
of inherited that right.when they picked up the code from spark. new stuff is 
going into org.apache.bahir


We need to extend several classes from spark which happen to have 
“private[spark].” For example, one of our class extends VectorUDT[0] which has 
private[spark] class VectorUDT as its access modifier. This unfortunately put 
us in a strange scenario that forces us to work under the namespace 
org.apache.spark.

To be specific, currently the private classes/traits we need to use to create 
new Spark learners & Transformers are HasInputCol, VectorUDT and Logging. We 
will expand this list as we develop more.

I do think tis a shame that logging went from public to private.

One thing that could be done there is to copy the logging into Bahir, under an 
org.apache.bahir package, for yourself and others to use. That's be beneficial 
to me too.

For the ML stuff, that might be place to work too, if you are going to open 
source the code.




Is there a way to avoid this namespace issue? What do other people/companies do 
in this scenario? Thank you for your help!

I've hit this problem in the past.  Scala code tends to force your hand here 
precisely because of that (very nice) private feature. While it offers the 
ability of a project to guarantee that implementation details aren't picked up 
where they weren't intended to be, in OSS dev, all that implementation is 
visible and for lower level integration,

What I tend to do is keep my own code in its package and try to do as think a 
bridge over to it from the [private] scope. It's also important to name things 
obviously, say,  org.apache.spark.microsoft , so stack traces in bug reports 
can be dealt with more easily



[0]: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fmllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fml%2Flinalg%2FVectorUDT.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=HjxQq3XAT%2FMljuNdU0MOorPhblMrnFcLezj9tebAht8%3D&reserved=0>

Best,
Shouheng




--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=Yq5F7xzV%2B8aqAoJyF0gePMG2cghRYonz68NDNvN9vjs%3D&reserved=0>

RE: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Reply via email to