Re: A major addition to Pig. Working with spatial data

Jonathan Coveney Mon, 06 May 2013 07:54:17 -0700

You can give them all the same label or tag and filter on that later on.


2013/5/6 Ahmed Eldawy <aseld...@gmail.com>

> Thanks all for taking the time to respond. Danial, I didn't know that Solr
> uses JTS. This is a good finding and we can definitely ask them to see if
> there is a work around we can do. Jonathan, I thought of the same idea of
> serializing/deserializing a bytearray each time a UDF is called. The
> deserialization part is good for letting Pig auto detect spatial types if
> not set explicitly in the schema. What is the best way to start this? I
> want to add an initial set of JIRA issues and start working on them but I
> also need to keep the work grouped in some sense just for organization.
>
> Thanks
> Ahmed
>
> Best regards,
> Ahmed Eldawy
>
>
> On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney <jcove...@gmail.com>
> wrote:
>
> > I agree that this is cool, and if other projects are using JTS it is
> worth
> > talking them to see how. I also agree that licensing is very frustrating.
> >
> > In the short term, however, while it is annoying to have to manage the
> > serialization and deserialization yourself, you can have the geometry
> type
> > be passed around as a bytearray type. Your UDF's will have to know this
> and
> > treat it accordingly, but if you did this then all of the tools could be
> in
> > an external project on github instead of a branch in Pig. Then, if we can
> > get the licensing done, we could add the Geometry type to Pig. Adding
> > types, honestly, is kind of tedious but not super difficult, so once the
> > rest is done, that shouldn't be too difficult.
> >
> >
> > 2013/5/4 Russell Jurney <russell.jur...@gmail.com>
> >
> > > If a way could be found, this would be an awesome addition to Pig.
> > >
> > > Russell Jurney http://datasyndrome.com
> > >
> > > On May 3, 2013, at 4:09 PM, Daniel Dai <da...@hortonworks.com> wrote:
> > >
> > > > I am not sure how other Apache projects dealing with it? Seems Solr
> > also
> > > > has some connector to JTS?
> > > >
> > > > Thanks,
> > > > Daniel
> > > >
> > > >
> > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy <aseld...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Alan for your interest. It's too bad that an open source
> > > licensing
> > > >> issue is holding me back from doing some open source work. I
> > understand
> > > the
> > > >> issue and your workarounds make sense. However, as I mentioned in
> the
> > > >> beginning, I don't want to have my own branch of Pig because it
> makes
> > my
> > > >> extension less portable. I'll think of another way to do it. I'll
> ask
> > > vivid
> > > >> solutions if they can double license their code although I think the
> > > answer
> > > >> will be no. I'll also think of a way to ship my extension as a set
> of
> > > jar
> > > >> files without the need to change the core of Pig. This way, it can
> be
> > > >> easily ported to newer versions of Pig.
> > > >>
> > > >> Thanks
> > > >> Ahmed
> > > >>
> > > >> Best regards,
> > > >> Ahmed Eldawy
> > > >>
> > > >>
> > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates <ga...@hortonworks.com>
> > > wrote:
> > > >>
> > > >>> I know this is frustrating, but the different licenses do have
> > > different
> > > >>> requirements that make it so that Apache can't ship GPL code.  A
> > legal
> > > >>> explanation is at
> > > http://www.apache.org/licenses/GPL-compatibility.htmlFor additional
> info
> > > on the LGPL specific questions see
> > > >>> http://www.apache.org/legal/3party.html
> > > >>>
> > > >>> As far as pulling it in via ivy, the issue isn't so much where the
> > code
> > > >>> lives as much as what code we are requiring to make Pig work.  If
> > > >> something
> > > >>> that is [L]GPL is required for Pig it violates Apache rules as
> > outlined
> > > >>> above.  It also would be a show stopper for a lot of companies that
> > > >>> redistribute Pig and that are allergic to GPL software.
> > > >>>
> > > >>> So, as I said before, if you wanted to continue with that library
> and
> > > >> they
> > > >>> are not willing to relicense it then it would have to be bolted on
> > > after
> > > >>> Apache Pig is built.  Nothing stops you from doing this by
> > downloading
> > > >>> Apache Pig, adding this library and your code, and redistributing,
> > > though
> > > >>> it wouldn't then be open to all Pig users.
> > > >>>
> > > >>> Alan.
> > > >>>
> > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote:
> > > >>>
> > > >>>> Thanks for your response. I was never good at differentiating all
> > > those
> > > >>>> open source licenses. I mean what is the point making open source
> > > >>> licenses
> > > >>>> if it blocks me from using a library in an open source project.
> Any
> > > >> way,
> > > >>>> I'm not going into debate here. Just one question, if we use JTS
> as
> > a
> > > >>>> library (jar file) without adding the code in Pig, is it still a
> > > >>> violation?
> > > >>>> We'll use ivy, for example, to download the jar file when
> compiling.
> > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" <ga...@hortonworks.com>
> wrote:
> > > >>>>
> > > >>>>> Passing on the technical details for a moment, I see a licensing
> > > >> issue.
> > > >>>>> JTS is licensed under LGPL.  Apache projects cannot contain or
> ship
> > > >>>>> [L]GPL.  Apache does not meet the requirements of GPL and thus we
> > > >> cannot
> > > >>>>> repackage their code. If you wanted to go forward using that
> class
> > > >> this
> > > >>>>> would have to be packaged as an add on that was downloaded
> > separately
> > > >>> and
> > > >>>>> not from Apache.  Another option is to work with the JTS
> community
> > > and
> > > >>> see
> > > >>>>> if they are willing to dual license their code under BSD or
> Apache
> > > >>> license
> > > >>>>> so that Pig could include it.  If neither of those are an option
> > you
> > > >>> would
> > > >>>>> need to come up with a new class to contain your spatial data.
> > > >>>>>
> > > >>>>> Alan.
> > > >>>>>
> > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote:
> > > >>>>>
> > > >>>>>> Hi all,
> > > >>>>>> First, sorry for the long email. I wanted to put all my thoughts
> > > here
> > > >>>>> and
> > > >>>>>> get your feedback.
> > > >>>>>> I'm proposing a major addition to Pig that will greatly increase
> > its
> > > >>>>>> functionality and user base. It is simply to add spatial support
> > to
> > > >> the
> > > >>>>>> language and the framework. I've already started working on that
> > but
> > > >> I
> > > >>>>>> don't want it to be just another branch. I want it, eventually,
> to
> > > be
> > > >>>>>> merged with the trunk of Apache Pig. So, I'm sending this email
> > > >> mainly
> > > >>> to
> > > >>>>>> reach out the main contributors of Pig to see the feasibility of
> > > >> this.
> > > >>>>>> This addition is a part of a big project we have been working on
> > in
> > > >>>>>> University of Minnesota; the project is called Spatial Hadoop.
> > > >>>>>> http://spatialhadoop.cs.umn.edu. It's about building a
> MapReduce
> > > >>>>> framework
> > > >>>>>> (Hadoop) that is capable of maintaining and analyzing spatial
> data
> > > >>>>>> efficiently. I'm the main guy behind that project and since we
> > > >> released
> > > >>>>> its
> > > >>>>>> first version, we received very encouraging responses from
> > different
> > > >>>>> groups
> > > >>>>>> in the research and industrial community. I'm sure the addition
> we
> > > >> want
> > > >>>>> to
> > > >>>>>> make to Pig Latin will be widely accepted by the people in the
> > > >> spatial
> > > >>>>>> community.
> > > >>>>>> I'm proposing a plan here while we're still in the early phases
> of
> > > >> this
> > > >>>>>> task to be able to discuss it with the main contributors and see
> > its
> > > >>>>>> feasibility. First of all, I think that we need to change the
> core
> > > of
> > > >>> Pig
> > > >>>>>> to be able to support spatial data. Providing a set of UDFs only
> > is
> > > >> not
> > > >>>>>> enough. The main reason is that Pig Latin does not provide a way
> > to
> > > >>>>> create
> > > >>>>>> a new data type which is needed for spatial data. Once we have
> the
> > > >>>>> spatial
> > > >>>>>> data types we need, the functionality can be expanded using more
> > > >> UDFs.
> > > >>>>>>
> > > >>>>>> Here's the plan as I see it.
> > > >>>>>> 1- Introduce a new primitive data type Geometry which represents
> > all
> > > >>>>>> spatial data types. In the underlying system, this will map to
> > > >>>>>> com.vividsolutions.jts.geom.Geometry. This is a class from Java
> > > >>> Topology
> > > >>>>>> Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a
> > > >> stable
> > > >>>>> and
> > > >>>>>> efficient open source Java library for spatial data types and
> > > >>> algorithms.
> > > >>>>>> It is very popular in the spatial community and a C++ port of it
> > is
> > > >>> used
> > > >>>>> in
> > > >>>>>> PostGIS [http://postgis.net/] (a spatial library for Postgres).
> > JTS
> > > >>> also
> > > >>>>>> conforms with Open Geospatial Consortium (OGC) [
> > > >>>>>> http://www.opengeospatial.org/] which is an open standard for
> the
> > > >>>>> spatial
> > > >>>>>> data types. The Geometry data type is read from and written to
> > text
> > > >>> files
> > > >>>>>> using the Well Known Text (WKT) format. There is also a way to
> > > >> convert
> > > >>> it
> > > >>>>>> to/from binary so that it can work with binary files and
> streams.
> > > >>>>>> 2- Add functions that manipulate spatial data types. These will
> be
> > > >>> added
> > > >>>>> as
> > > >>>>>> UDFs and we will not need to mess with the internals of Pig.
> Most
> > > >>>>> probably,
> > > >>>>>> there will be one new class for each operation (e.g., union or
> > > >>>>>> intersection). I think it will be good to put these new
> operations
> > > >>> inside
> > > >>>>>> the core of Pig so that users can use it without having to write
> > the
> > > >>>>> fully
> > > >>>>>> qualified class name. Also, since there is no way to implicitly
> > cast
> > > >> a
> > > >>>>>> spatial data type to a non-spatial data types, there will not be
> > any
> > > >>>>>> conflicts in existing operations or new operations. All new
> > > >> operations,
> > > >>>>> and
> > > >>>>>> only the new operations, will be working on spatial data types.
> > Here
> > > >> is
> > > >>>>> an
> > > >>>>>> initial list of operations that can be added. All those
> operations
> > > >> are
> > > >>>>>> already implemented in JTS and the UDFs added to Pig will be
> just
> > > >>>>> wrappers
> > > >>>>>> around them.
> > > >>>>>> **Predicates (used for spatial filtering)
> > > >>>>>> Equals
> > > >>>>>> Disjoint
> > > >>>>>> Intersects
> > > >>>>>> Touches
> > > >>>>>> Crosses
> > > >>>>>> Within
> > > >>>>>> Contains
> > > >>>>>> Overlaps
> > > >>>>>>
> > > >>>>>> **Operations
> > > >>>>>> Envelope
> > > >>>>>> Area
> > > >>>>>> Length
> > > >>>>>> Buffer
> > > >>>>>> ConvexHull
> > > >>>>>> Intersection
> > > >>>>>> Union
> > > >>>>>> Difference
> > > >>>>>> SymDifference
> > > >>>>>>
> > > >>>>>> **Aggregate functions
> > > >>>>>> Accum
> > > >>>>>> ConvexHull
> > > >>>>>> Union
> > > >>>>>>
> > > >>>>>> 3- The third step is to implement spatial indexes (e.g., Grid or
> > > >>>>> R-tree). A
> > > >>>>>> Pig loader and Pig output classes will be created for those
> > indexes.
> > > >>> Note
> > > >>>>>> that currently we have SpatialOutputFormat and
> SpatialInputFormat
> > > for
> > > >>>>> those
> > > >>>>>> indexes inside the Spatial Hadoop project, but we need to tweak
> > them
> > > >> to
> > > >>>>>> work with Pig.
> > > >>>>>>
> > > >>>>>> 4- (Advanced) Implement more sophisticated algorithms for
> spatial
> > > >>>>>> operations that utilize the indexes. For example, we can have a
> > > >>> specific
> > > >>>>>> algorithm for spatial range query or spatial join. Again, we
> > already
> > > >>> have
> > > >>>>>> algorithms built for different operations implemented in Spatial
> > > >> Hadoop
> > > >>>>> as
> > > >>>>>> MapReduce programs, but they will need to be modified to work in
> > Pig
> > > >>>>>> environment and get to work with other operations.
> > > >>>>>>
> > > >>>>>> This is my whole plan for the spatial extension to Pig. I've
> > already
> > > >>>>>> started with the first step but as I mentioned earlier, I don't
> > want
> > > >> to
> > > >>>>> do
> > > >>>>>> the work for our project and then the work gets forgotten. I
> want
> > to
> > > >>>>>> contribute to Pig and do my research at the same time. If you
> > think
> > > >> the
> > > >>>>>> plan is plausible, I'll open JIRA issues for the above tasks and
> > > >> start
> > > >>>>>> shipping patches to do the stuff. I'll conform with the
> standards
> > of
> > > >>> the
> > > >>>>>> project such as adding tests and well commenting the code.
> > > >>>>>> Sorry for the long email and hope to hear back from you.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> Best regards,
> > > >>>>>> Ahmed Eldawy
> > > >>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > >
> >
>

Re: A major addition to Pig. Working with spatial data

Reply via email to