Thanks all for taking the time to respond. Danial, I didn't know that Solr uses JTS. This is a good finding and we can definitely ask them to see if there is a work around we can do. Jonathan, I thought of the same idea of serializing/deserializing a bytearray each time a UDF is called. The deserialization part is good for letting Pig auto detect spatial types if not set explicitly in the schema. What is the best way to start this? I want to add an initial set of JIRA issues and start working on them but I also need to keep the work grouped in some sense just for organization.
Thanks Ahmed Best regards, Ahmed Eldawy On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney <jcove...@gmail.com> wrote: > I agree that this is cool, and if other projects are using JTS it is worth > talking them to see how. I also agree that licensing is very frustrating. > > In the short term, however, while it is annoying to have to manage the > serialization and deserialization yourself, you can have the geometry type > be passed around as a bytearray type. Your UDF's will have to know this and > treat it accordingly, but if you did this then all of the tools could be in > an external project on github instead of a branch in Pig. Then, if we can > get the licensing done, we could add the Geometry type to Pig. Adding > types, honestly, is kind of tedious but not super difficult, so once the > rest is done, that shouldn't be too difficult. > > > 2013/5/4 Russell Jurney <russell.jur...@gmail.com> > > > If a way could be found, this would be an awesome addition to Pig. > > > > Russell Jurney http://datasyndrome.com > > > > On May 3, 2013, at 4:09 PM, Daniel Dai <da...@hortonworks.com> wrote: > > > > > I am not sure how other Apache projects dealing with it? Seems Solr > also > > > has some connector to JTS? > > > > > > Thanks, > > > Daniel > > > > > > > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy <aseld...@gmail.com> > > wrote: > > > > > >> Thanks Alan for your interest. It's too bad that an open source > > licensing > > >> issue is holding me back from doing some open source work. I > understand > > the > > >> issue and your workarounds make sense. However, as I mentioned in the > > >> beginning, I don't want to have my own branch of Pig because it makes > my > > >> extension less portable. I'll think of another way to do it. I'll ask > > vivid > > >> solutions if they can double license their code although I think the > > answer > > >> will be no. I'll also think of a way to ship my extension as a set of > > jar > > >> files without the need to change the core of Pig. This way, it can be > > >> easily ported to newer versions of Pig. > > >> > > >> Thanks > > >> Ahmed > > >> > > >> Best regards, > > >> Ahmed Eldawy > > >> > > >> > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates <ga...@hortonworks.com> > > wrote: > > >> > > >>> I know this is frustrating, but the different licenses do have > > different > > >>> requirements that make it so that Apache can't ship GPL code. A > legal > > >>> explanation is at > > http://www.apache.org/licenses/GPL-compatibility.htmlFor additional info > > on the LGPL specific questions see > > >>> http://www.apache.org/legal/3party.html > > >>> > > >>> As far as pulling it in via ivy, the issue isn't so much where the > code > > >>> lives as much as what code we are requiring to make Pig work. If > > >> something > > >>> that is [L]GPL is required for Pig it violates Apache rules as > outlined > > >>> above. It also would be a show stopper for a lot of companies that > > >>> redistribute Pig and that are allergic to GPL software. > > >>> > > >>> So, as I said before, if you wanted to continue with that library and > > >> they > > >>> are not willing to relicense it then it would have to be bolted on > > after > > >>> Apache Pig is built. Nothing stops you from doing this by > downloading > > >>> Apache Pig, adding this library and your code, and redistributing, > > though > > >>> it wouldn't then be open to all Pig users. > > >>> > > >>> Alan. > > >>> > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote: > > >>> > > >>>> Thanks for your response. I was never good at differentiating all > > those > > >>>> open source licenses. I mean what is the point making open source > > >>> licenses > > >>>> if it blocks me from using a library in an open source project. Any > > >> way, > > >>>> I'm not going into debate here. Just one question, if we use JTS as > a > > >>>> library (jar file) without adding the code in Pig, is it still a > > >>> violation? > > >>>> We'll use ivy, for example, to download the jar file when compiling. > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" <ga...@hortonworks.com> wrote: > > >>>> > > >>>>> Passing on the technical details for a moment, I see a licensing > > >> issue. > > >>>>> JTS is licensed under LGPL. Apache projects cannot contain or ship > > >>>>> [L]GPL. Apache does not meet the requirements of GPL and thus we > > >> cannot > > >>>>> repackage their code. If you wanted to go forward using that class > > >> this > > >>>>> would have to be packaged as an add on that was downloaded > separately > > >>> and > > >>>>> not from Apache. Another option is to work with the JTS community > > and > > >>> see > > >>>>> if they are willing to dual license their code under BSD or Apache > > >>> license > > >>>>> so that Pig could include it. If neither of those are an option > you > > >>> would > > >>>>> need to come up with a new class to contain your spatial data. > > >>>>> > > >>>>> Alan. > > >>>>> > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote: > > >>>>> > > >>>>>> Hi all, > > >>>>>> First, sorry for the long email. I wanted to put all my thoughts > > here > > >>>>> and > > >>>>>> get your feedback. > > >>>>>> I'm proposing a major addition to Pig that will greatly increase > its > > >>>>>> functionality and user base. It is simply to add spatial support > to > > >> the > > >>>>>> language and the framework. I've already started working on that > but > > >> I > > >>>>>> don't want it to be just another branch. I want it, eventually, to > > be > > >>>>>> merged with the trunk of Apache Pig. So, I'm sending this email > > >> mainly > > >>> to > > >>>>>> reach out the main contributors of Pig to see the feasibility of > > >> this. > > >>>>>> This addition is a part of a big project we have been working on > in > > >>>>>> University of Minnesota; the project is called Spatial Hadoop. > > >>>>>> http://spatialhadoop.cs.umn.edu. It's about building a MapReduce > > >>>>> framework > > >>>>>> (Hadoop) that is capable of maintaining and analyzing spatial data > > >>>>>> efficiently. I'm the main guy behind that project and since we > > >> released > > >>>>> its > > >>>>>> first version, we received very encouraging responses from > different > > >>>>> groups > > >>>>>> in the research and industrial community. I'm sure the addition we > > >> want > > >>>>> to > > >>>>>> make to Pig Latin will be widely accepted by the people in the > > >> spatial > > >>>>>> community. > > >>>>>> I'm proposing a plan here while we're still in the early phases of > > >> this > > >>>>>> task to be able to discuss it with the main contributors and see > its > > >>>>>> feasibility. First of all, I think that we need to change the core > > of > > >>> Pig > > >>>>>> to be able to support spatial data. Providing a set of UDFs only > is > > >> not > > >>>>>> enough. The main reason is that Pig Latin does not provide a way > to > > >>>>> create > > >>>>>> a new data type which is needed for spatial data. Once we have the > > >>>>> spatial > > >>>>>> data types we need, the functionality can be expanded using more > > >> UDFs. > > >>>>>> > > >>>>>> Here's the plan as I see it. > > >>>>>> 1- Introduce a new primitive data type Geometry which represents > all > > >>>>>> spatial data types. In the underlying system, this will map to > > >>>>>> com.vividsolutions.jts.geom.Geometry. This is a class from Java > > >>> Topology > > >>>>>> Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a > > >> stable > > >>>>> and > > >>>>>> efficient open source Java library for spatial data types and > > >>> algorithms. > > >>>>>> It is very popular in the spatial community and a C++ port of it > is > > >>> used > > >>>>> in > > >>>>>> PostGIS [http://postgis.net/] (a spatial library for Postgres). > JTS > > >>> also > > >>>>>> conforms with Open Geospatial Consortium (OGC) [ > > >>>>>> http://www.opengeospatial.org/] which is an open standard for the > > >>>>> spatial > > >>>>>> data types. The Geometry data type is read from and written to > text > > >>> files > > >>>>>> using the Well Known Text (WKT) format. There is also a way to > > >> convert > > >>> it > > >>>>>> to/from binary so that it can work with binary files and streams. > > >>>>>> 2- Add functions that manipulate spatial data types. These will be > > >>> added > > >>>>> as > > >>>>>> UDFs and we will not need to mess with the internals of Pig. Most > > >>>>> probably, > > >>>>>> there will be one new class for each operation (e.g., union or > > >>>>>> intersection). I think it will be good to put these new operations > > >>> inside > > >>>>>> the core of Pig so that users can use it without having to write > the > > >>>>> fully > > >>>>>> qualified class name. Also, since there is no way to implicitly > cast > > >> a > > >>>>>> spatial data type to a non-spatial data types, there will not be > any > > >>>>>> conflicts in existing operations or new operations. All new > > >> operations, > > >>>>> and > > >>>>>> only the new operations, will be working on spatial data types. > Here > > >> is > > >>>>> an > > >>>>>> initial list of operations that can be added. All those operations > > >> are > > >>>>>> already implemented in JTS and the UDFs added to Pig will be just > > >>>>> wrappers > > >>>>>> around them. > > >>>>>> **Predicates (used for spatial filtering) > > >>>>>> Equals > > >>>>>> Disjoint > > >>>>>> Intersects > > >>>>>> Touches > > >>>>>> Crosses > > >>>>>> Within > > >>>>>> Contains > > >>>>>> Overlaps > > >>>>>> > > >>>>>> **Operations > > >>>>>> Envelope > > >>>>>> Area > > >>>>>> Length > > >>>>>> Buffer > > >>>>>> ConvexHull > > >>>>>> Intersection > > >>>>>> Union > > >>>>>> Difference > > >>>>>> SymDifference > > >>>>>> > > >>>>>> **Aggregate functions > > >>>>>> Accum > > >>>>>> ConvexHull > > >>>>>> Union > > >>>>>> > > >>>>>> 3- The third step is to implement spatial indexes (e.g., Grid or > > >>>>> R-tree). A > > >>>>>> Pig loader and Pig output classes will be created for those > indexes. > > >>> Note > > >>>>>> that currently we have SpatialOutputFormat and SpatialInputFormat > > for > > >>>>> those > > >>>>>> indexes inside the Spatial Hadoop project, but we need to tweak > them > > >> to > > >>>>>> work with Pig. > > >>>>>> > > >>>>>> 4- (Advanced) Implement more sophisticated algorithms for spatial > > >>>>>> operations that utilize the indexes. For example, we can have a > > >>> specific > > >>>>>> algorithm for spatial range query or spatial join. Again, we > already > > >>> have > > >>>>>> algorithms built for different operations implemented in Spatial > > >> Hadoop > > >>>>> as > > >>>>>> MapReduce programs, but they will need to be modified to work in > Pig > > >>>>>> environment and get to work with other operations. > > >>>>>> > > >>>>>> This is my whole plan for the spatial extension to Pig. I've > already > > >>>>>> started with the first step but as I mentioned earlier, I don't > want > > >> to > > >>>>> do > > >>>>>> the work for our project and then the work gets forgotten. I want > to > > >>>>>> contribute to Pig and do my research at the same time. If you > think > > >> the > > >>>>>> plan is plausible, I'll open JIRA issues for the above tasks and > > >> start > > >>>>>> shipping patches to do the stuff. I'll conform with the standards > of > > >>> the > > >>>>>> project such as adding tests and well commenting the code. > > >>>>>> Sorry for the long email and hope to hear back from you. > > >>>>>> > > >>>>>> > > >>>>>> Best regards, > > >>>>>> Ahmed Eldawy > > >>>>> > > >>>>> > > >>> > > >>> > > >> > > >