Re: Long-term thoughts about big-data queries in SIS

Mattmann, Chris A (3980) Tue, 10 Nov 2015 20:27:08 -0800

maybe also look at new Big Data tech like AsterixDB in the Incubator
and also look at Search tech like Lucene and/or Solr?


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Marc Le Bihan <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, November 10, 2015 at 1:44 PM
To: "[email protected]" <[email protected]>
Cc: Mark Giaconia <[email protected]>
Subject: Re: Long-term thoughts about big-data queries in SIS

>Hello,
>
>About the development of the SQL Driver I can make you a status on it :
>    1) The driver is currently working for SELECT statements over DBase
>III 
>files, provided the WHERE clause used (if any) is limited to simple
>conditions.
>
>    2) I use it currently with real DBase III files coming from various
>places in order to challenge it.
>
>    3) Parsing of statements is now the main difficulty I have, and this
>subject started a debate few months ago : if I continue clause by clause
>(attempting to detect a GROUP BY, a HAVING, a LIKE ... "manually" it will
>be 
>long and difficult.
>    If I use a parser like AntLR, it will be potent and complete, but
>this 
>API is known to be really hard to handle and to make working perfectly. I
>used it four times, but I still fear each time I'm using it. But I think
>that it's the only solution.
>
>    4) UPDATE statement could come quickly.
>
>    5) For DELETE I have to see if logical delete can be done to avoid
>re-writing the whole file, and for INSERT a new entry will have to be
>set. 
>It's not easy, because if an index file comes with the DBF III file I
>have 
>to update it too.
>    And also : I have to find a way to manage Shapefile reader to
>continue 
>following the content of the DBase file. If I destroy a record on the
>DBase 
>file, the associated entry in Shapefile should not be valid anymore, for
>example.
>    And removals or insertions in shapefile would lead to make some
>changes 
>in their index files too.
>
>    6) The interfaces, the abstract subclasses used to help the
>implementation of DBase III connection, statement, resultset, metadata
>will 
>be helpful to develop another driver for another kind of database, if
>needed. But these abstract classes still might change :
>    I expect many discoveries until the end.
>
>    7) What the good choice after this first part of work (the CRUD
>operations) : being able to handle transactions, or being able to
>implement 
>JPA interfaces ? The two are valuables goals.
>
>Regards,
>
>Marc Le Bihan
>
>
>-----Message d'origine-----
>From: Adam Estrada
>Sent: Tuesday, November 10, 2015 7:20 PM
>To: [email protected]
>Cc: Mark Giaconia
>Subject: Re: Long-term thoughts about big-data queries in SIS
>
>Martin,
>
>This is extremely cool and much needed in the geospatial community! My
>company, DigitalGlobe, has done a lot with this and has open sourced
>many of the packages that can be found on GitHub today. Rasdaman[1]
>and PostGIS Raster are other open source examples of how to do this in
>relational databases. We have done a lot of research on how to store
>pixels and query for them in HBASE/Hadoop and ElasticSearch too. There
>are many options to this one!
>
>Adam
>
>[1] http://rasdaman.org/
>
>On Tue, Nov 10, 2015 at 6:09 AM, Martin Desruisseaux
><[email protected]> wrote:
>> Hello all
>>
>> In the BigData Apache Conference in Budapest, I attended to some
>> meetings about exploiting geospatial big data using SQL language. I
>> though that we could make some long-term plans that could impact the
>> SIS-180 ( Place a crude JDBC driver over Dbase files) work [1]. This
>> email is not a request for any change now. This is just a proposal about
>> some possible long term plans.
>>
>> In one or two years, Apache SIS would hopefully have some DataStore
>> implementations ready for production use. But we have a strong request
>> for capability to use DataStores with big-data technologies like Hadoop.
>> This request increases the challenge of writing a SQL driver, since a
>> sophisticated SQL driver would need to be able to restructure query
>> plans according the available clusters.
>>
>> I had a discussion with peoples from Apache Drill project
>> (https://drill.apache.org/), which already provide SQL parsing
>> capabilities in various big-data environments. In my understanding,
>> instead of writing our own SQL parser in SIS we could have the following
>> approach:
>>
>>  1. Complete the org.apache.sis.storage.DataStore API (it is currently
>>     very minimalist).
>>  2. Have the ShapeFile store to extend the abstract SIS DataStore.
>>  3. In a separated module, write a "SIS DataStore to Drill DataStore"
>>     adapter. It should work for any SIS DataStore, not only the
>>     ShapeFile one.
>>
>> In my understanding once we have a Drill DataStore implementation (I do
>> not know yet what is the exact name in Drill API), we should
>> automatically get big-data-ready SQL for any SIS DataStore. If for any
>> reason Drill DataStore is considered not suitable, we could fallback on
>> Apache Calcite (http://calcite.apache.org/), which is the SQL parser
>> used under the hood by Drill. Another project that may be worth to
>> explore is Magellan: Geospatial Analytics on Spark [2].
>>
>> My proposal could be summarized as below: maybe in 2016 or 2017, we
>> could consider to put the SIS SQL support in its own module and allows
>> it to run not only for ShapeFile, but for any SIS DataStore, if possible
>> using technology like Drill designed for big-data environments.
>>
>> Any thoughts?
>>
>>     Martin
>>
>>
>> [1] https://issues.apache.org/jira/browse/SIS-180
>> [2] https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/
>>
>> 
>

Re: Long-term thoughts about big-data queries in SIS

Reply via email to