maybe also look at new Big Data tech like AsterixDB in the Incubator and also look at Search tech like Lucene and/or Solr?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Marc Le Bihan <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, November 10, 2015 at 1:44 PM To: "[email protected]" <[email protected]> Cc: Mark Giaconia <[email protected]> Subject: Re: Long-term thoughts about big-data queries in SIS >Hello, > >About the development of the SQL Driver I can make you a status on it : > 1) The driver is currently working for SELECT statements over DBase >III >files, provided the WHERE clause used (if any) is limited to simple >conditions. > > 2) I use it currently with real DBase III files coming from various >places in order to challenge it. > > 3) Parsing of statements is now the main difficulty I have, and this >subject started a debate few months ago : if I continue clause by clause >(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will >be >long and difficult. > If I use a parser like AntLR, it will be potent and complete, but >this >API is known to be really hard to handle and to make working perfectly. I >used it four times, but I still fear each time I'm using it. But I think >that it's the only solution. > > 4) UPDATE statement could come quickly. > > 5) For DELETE I have to see if logical delete can be done to avoid >re-writing the whole file, and for INSERT a new entry will have to be >set. >It's not easy, because if an index file comes with the DBF III file I >have >to update it too. > And also : I have to find a way to manage Shapefile reader to >continue >following the content of the DBase file. If I destroy a record on the >DBase >file, the associated entry in Shapefile should not be valid anymore, for >example. > And removals or insertions in shapefile would lead to make some >changes >in their index files too. > > 6) The interfaces, the abstract subclasses used to help the >implementation of DBase III connection, statement, resultset, metadata >will >be helpful to develop another driver for another kind of database, if >needed. But these abstract classes still might change : > I expect many discoveries until the end. > > 7) What the good choice after this first part of work (the CRUD >operations) : being able to handle transactions, or being able to >implement >JPA interfaces ? The two are valuables goals. > >Regards, > >Marc Le Bihan > > >-----Message d'origine----- >From: Adam Estrada >Sent: Tuesday, November 10, 2015 7:20 PM >To: [email protected] >Cc: Mark Giaconia >Subject: Re: Long-term thoughts about big-data queries in SIS > >Martin, > >This is extremely cool and much needed in the geospatial community! My >company, DigitalGlobe, has done a lot with this and has open sourced >many of the packages that can be found on GitHub today. Rasdaman[1] >and PostGIS Raster are other open source examples of how to do this in >relational databases. We have done a lot of research on how to store >pixels and query for them in HBASE/Hadoop and ElasticSearch too. There >are many options to this one! > >Adam > >[1] http://rasdaman.org/ > >On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux ><[email protected]> wrote: >> Hello all >> >> In the BigData Apache Conference in Budapest, I attended to some >> meetings about exploiting geospatial big data using SQL language. I >> though that we could make some long-term plans that could impact the >> SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This >> email is not a request for any change now. This is just a proposal about >> some possible long term plans. >> >> In one or two years, Apache SIS would hopefully have some DataStore >> implementations ready for production use. But we have a strong request >> for capability to use DataStores with big-data technologies like Hadoop. >> This request increases the challenge of writing a SQL driver, since a >> sophisticated SQL driver would need to be able to restructure query >> plans according the available clusters. >> >> I had a discussion with peoples from Apache Drill project >> (https://drill.apache.org/), which already provide SQL parsing >> capabilities in various big-data environments. In my understanding, >> instead of writing our own SQL parser in SIS we could have the following >> approach: >> >> 1. Complete the org.apache.sis.storage.DataStore API (it is currently >> very minimalist). >> 2. Have the ShapeFile store to extend the abstract SIS DataStore. >> 3. In a separated module, write a "SIS DataStore to Drill DataStore" >> adapter. It should work for any SIS DataStore, not only the >> ShapeFile one. >> >> In my understanding once we have a Drill DataStore implementation (I do >> not know yet what is the exact name in Drill API), we should >> automatically get big-data-ready SQL for any SIS DataStore. If for any >> reason Drill DataStore is considered not suitable, we could fallback on >> Apache Calcite (http://calcite.apache.org/), which is the SQL parser >> used under the hood by Drill. Another project that may be worth to >> explore is Magellan: Geospatial Analytics on Spark [2]. >> >> My proposal could be summarized as below: maybe in 2016 or 2017, we >> could consider to put the SIS SQL support in its own module and allows >> it to run not only for ShapeFile, but for any SIS DataStore, if possible >> using technology like Drill designed for big-data environments. >> >> Any thoughts? >> >> Martin >> >> >> [1] https://issues.apache.org/jira/browse/SIS-180 >> [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ >> >> >
