Re: A major addition to Pig. Working with spatial data

Ahmed Eldawy Thu, 02 May 2013 12:00:37 -0700

Thanks Alan for your interest. It's too bad that an open source licensing
issue is holding me back from doing some open source work. I understand the
issue and your workarounds make sense. However, as I mentioned in the
beginning, I don't want to have my own branch of Pig because it makes my
extension less portable. I'll think of another way to do it. I'll ask vivid
solutions if they can double license their code although I think the answer
will be no. I'll also think of a way to ship my extension as a set of jar
files without the need to change the core of Pig. This way, it can be
easily ported to newer versions of Pig.


Thanks
Ahmed

Best regards,
Ahmed Eldawy


On Thu, May 2, 2013 at 12:33 PM, Alan Gates <[email protected]> wrote:

> I know this is frustrating, but the different licenses do have different
> requirements that make it so that Apache can't ship GPL code.  A legal
> explanation is at http://www.apache.org/licenses/GPL-compatibility.html For 
> additional info on the LGPL specific questions see
> http://www.apache.org/legal/3party.html
>
> As far as pulling it in via ivy, the issue isn't so much where the code
> lives as much as what code we are requiring to make Pig work.  If something
> that is [L]GPL is required for Pig it violates Apache rules as outlined
> above.  It also would be a show stopper for a lot of companies that
> redistribute Pig and that are allergic to GPL software.
>
> So, as I said before, if you wanted to continue with that library and they
> are not willing to relicense it then it would have to be bolted on after
> Apache Pig is built.  Nothing stops you from doing this by downloading
> Apache Pig, adding this library and your code, and redistributing, though
> it wouldn't then be open to all Pig users.
>
> Alan.
>
> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:
>
> > Thanks for your response. I was never good at differentiating all those
> > open source licenses. I mean what is the point making open source
> licenses
> > if it blocks me from using a library in an open source project. Any way,
> > I'm not going into debate here. Just one question, if we use JTS as a
> > library (jar file) without adding the code in Pig, is it still a
> violation?
> > We'll use ivy, for example, to download the jar file when compiling.
> > On May 1, 2013 7:50 PM, "Alan Gates" <[email protected]> wrote:
> >
> >> Passing on the technical details for a moment, I see a licensing issue.
> >> JTS is licensed under LGPL.  Apache projects cannot contain or ship
> >> [L]GPL.  Apache does not meet the requirements of GPL and thus we cannot
> >> repackage their code. If you wanted to go forward using that class this
> >> would have to be packaged as an add on that was downloaded separately
> and
> >> not from Apache.  Another option is to work with the JTS community and
> see
> >> if they are willing to dual license their code under BSD or Apache
> license
> >> so that Pig could include it.  If neither of those are an option you
> would
> >> need to come up with a new class to contain your spatial data.
> >>
> >> Alan.
> >>
> >> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:
> >>
> >>> Hi all,
> >>> First, sorry for the long email. I wanted to put all my thoughts here
> >> and
> >>> get your feedback.
> >>> I'm proposing a major addition to Pig that will greatly increase its
> >>> functionality and user base. It is simply to add spatial support to the
> >>> language and the framework. I've already started working on that but I
> >>> don't want it to be just another branch. I want it, eventually, to be
> >>> merged with the trunk of Apache Pig. So, I'm sending this email mainly
> to
> >>> reach out the main contributors of Pig to see the feasibility of this.
> >>> This addition is a part of a big project we have been working on in
> >>> University of Minnesota; the project is called Spatial Hadoop.
> >>> http://spatialhadoop.cs.umn.edu. It's about building a MapReduce
> >> framework
> >>> (Hadoop) that is capable of maintaining and analyzing spatial data
> >>> efficiently. I'm the main guy behind that project and since we released
> >> its
> >>> first version, we received very encouraging responses from different
> >> groups
> >>> in the research and industrial community. I'm sure the addition we want
> >> to
> >>> make to Pig Latin will be widely accepted by the people in the spatial
> >>> community.
> >>> I'm proposing a plan here while we're still in the early phases of this
> >>> task to be able to discuss it with the main contributors and see its
> >>> feasibility. First of all, I think that we need to change the core of
> Pig
> >>> to be able to support spatial data. Providing a set of UDFs only is not
> >>> enough. The main reason is that Pig Latin does not provide a way to
> >> create
> >>> a new data type which is needed for spatial data. Once we have the
> >> spatial
> >>> data types we need, the functionality can be expanded using more UDFs.
> >>>
> >>> Here's the plan as I see it.
> >>> 1- Introduce a new primitive data type Geometry which represents all
> >>> spatial data types. In the underlying system, this will map to
> >>> com.vividsolutions.jts.geom.Geometry. This is a class from Java
> Topology
> >>> Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a stable
> >> and
> >>> efficient open source Java library for spatial data types and
> algorithms.
> >>> It is very popular in the spatial community and a C++ port of it is
> used
> >> in
> >>> PostGIS [http://postgis.net/] (a spatial library for Postgres). JTS
> also
> >>> conforms with Open Geospatial Consortium (OGC) [
> >>> http://www.opengeospatial.org/] which is an open standard for the
> >> spatial
> >>> data types. The Geometry data type is read from and written to text
> files
> >>> using the Well Known Text (WKT) format. There is also a way to convert
> it
> >>> to/from binary so that it can work with binary files and streams.
> >>> 2- Add functions that manipulate spatial data types. These will be
> added
> >> as
> >>> UDFs and we will not need to mess with the internals of Pig. Most
> >> probably,
> >>> there will be one new class for each operation (e.g., union or
> >>> intersection). I think it will be good to put these new operations
> inside
> >>> the core of Pig so that users can use it without having to write the
> >> fully
> >>> qualified class name. Also, since there is no way to implicitly cast a
> >>> spatial data type to a non-spatial data types, there will not be any
> >>> conflicts in existing operations or new operations. All new operations,
> >> and
> >>> only the new operations, will be working on spatial data types. Here is
> >> an
> >>> initial list of operations that can be added. All those operations are
> >>> already implemented in JTS and the UDFs added to Pig will be just
> >> wrappers
> >>> around them.
> >>> **Predicates (used for spatial filtering)
> >>> Equals
> >>> Disjoint
> >>> Intersects
> >>> Touches
> >>> Crosses
> >>> Within
> >>> Contains
> >>> Overlaps
> >>>
> >>> **Operations
> >>> Envelope
> >>> Area
> >>> Length
> >>> Buffer
> >>> ConvexHull
> >>> Intersection
> >>> Union
> >>> Difference
> >>> SymDifference
> >>>
> >>> **Aggregate functions
> >>> Accum
> >>> ConvexHull
> >>> Union
> >>>
> >>> 3- The third step is to implement spatial indexes (e.g., Grid or
> >> R-tree). A
> >>> Pig loader and Pig output classes will be created for those indexes.
> Note
> >>> that currently we have SpatialOutputFormat and SpatialInputFormat for
> >> those
> >>> indexes inside the Spatial Hadoop project, but we need to tweak them to
> >>> work with Pig.
> >>>
> >>> 4- (Advanced) Implement more sophisticated algorithms for spatial
> >>> operations that utilize the indexes. For example, we can have a
> specific
> >>> algorithm for spatial range query or spatial join. Again, we already
> have
> >>> algorithms built for different operations implemented in Spatial Hadoop
> >> as
> >>> MapReduce programs, but they will need to be modified to work in Pig
> >>> environment and get to work with other operations.
> >>>
> >>> This is my whole plan for the spatial extension to Pig. I've already
> >>> started with the first step but as I mentioned earlier, I don't want to
> >> do
> >>> the work for our project and then the work gets forgotten. I want to
> >>> contribute to Pig and do my research at the same time. If you think the
> >>> plan is plausible, I'll open JIRA issues for the above tasks and start
> >>> shipping patches to do the stuff. I'll conform with the standards of
> the
> >>> project such as adding tests and well commenting the code.
> >>> Sorry for the long email and hope to hear back from you.
> >>>
> >>>
> >>> Best regards,
> >>> Ahmed Eldawy
> >>
> >>
>
>

Re: A major addition to Pig. Working with spatial data

Reply via email to