Le jeudi 20 mars 2014 05:02:33, Zhang, Shuai a écrit :
Hi All,
I think i need a mentor working with me and help me make gdal under mongodb
support. Below is the proposal i wrote, hopefully you find it worth a
trial.
This is something I may potentially mentor, but there are already 2 students
interested on other subjects. I'm not sure how many will get eventually
selected by the GSOC program, but I won't be able to mentor 3 people for sure
!
Thanks,
shuai
Title: OGR Driver for MongoDB
Short description:
MongoDB, a document database that provides high performance, high
availability, and easy scalability, can be a good platform for storing
extremely large spatial datasets, to support high performance
geo-computation and real-time spatial analysis in a large scale.This
project aims at developing a OGR Driver for MongoDB to help applications
or softwares based on GDAL, such QGIS, Geoserver, Mapserver, and so on,
read & write the spatial data in it, and thus enable the Open Source GIS
Ecosystem powered by the advanced NoSQL database.
Describe your idea
1. Introduction
MongoDB, a document database that provides high performance, high
availability, and easy scalability, can be a good platform for storing
extremely large spatial datasets, to support high performance
geo-computation and real-time spatial analysis in a large scale. Yet,
there is little attention so far that GIS fields pay to make most of its
strength. This project aims at developing a OGR Driver for MongoDB to help
applications or softwares based on GDAL read & write the spatial data in
it, and thus enable the Open Source GIS Ecosystem powered by the advanced
NoSQL database.
2. Background
Since we are living in the era of big data, tools and equipment today for
capturing spatial data both at the mega-scale and the milli-scale are just
dreadful. The magnitude of this data volume is well beyond the capability
of any mainstream geographic information systems. Yet, we, GIS fields,
have no off-the-shelf solutions to manage these massive spatial data.
Relational spatial databases have taken in charge for decades but now the
situation seems a little different.
A computing pattern shift can be seen throughout the IT industry in recent
years and GIS would be no exception. Especially, data analytics may not be
achievable within a reasonable amount of time without resorting to
high-performance computing strategies. However, relational spatial
databases are kind of slow to support these high-performance computing
scenarios, and often lack of flexible scalability to handle a growing
amount of work in a capable manner.
Fortunately, there are several groups trying to address the problem, and
MongoDB is an apparent leader in this direction. MongoDB, which has native
support for maintaining geospatial data, using a document-oriented model,
lies in fifth place in the DB-Engines Ranking of database management
systems classed according to popularity and the highest rated
non-relational system. From version 2.4 (released on March 19, 2013),
MongoDB introduces support for a subset of GeoJSON geometries including
basic shapes like points, linestrings, polygons.
Good to know. Last time I looked, MongoDB had only support for point
geometries.
And quite a number of
partners related with big data, NoSQL, cloud, mobile and high performance
computing join the MongoDB ecosystem. Foursquare is featured one of them
which benefits from MongoDB’s support for geospatial indexing, allowing it
to easily query for large location-based data.
3. The idea
MongoDB employs GeoJSON to store spatial data and concurrently GDAL
supports for access to features encoded in GeoJSON format, which can be
reusable.
As far as I remember, the interface with MongoDB is (was?) a kind of binary
JSON format. Has this changed ?
This project is trying to implement a MongoDB Driver according
to the OGR format driver interfaces with subclasses of OGRSFDriver,
OGRDataSource and OGRLayer, and registered with the OGRSFDriverRegistrar
at runtime, so that GDAL may use MongoDB as a datasource to access large
scale spatial data.
4. Project plan (detailed timeline: how do you plan to spend your summer?)
The first thing in the list is to design the structure inside of MongoDB
spatial database. In the context of OGR data model, we got Datasource,
Layer and Feature, so accordingly every database in MongoDB is regarded as
a Datasource, and the Collections within should be treated as Layers, thus
every Document as a Feature.
Yes, sounds a bit similar to what was done with CouchDB
PostGIS and other spatial databases often
harness some system tables to maintain the metadata, but since MongoDB is
schema free metadata such as spatial reference can be stored within the
particular Layer, in this case a Collection.
The most important part of a data format driver is to define how to read
and write the data format in the specific driver, especially the Open and
Create method in the Datasource Class. As MongoDB organizes its spatial
data in GeoJSON model, the GeoJSON driver already supported by current
GDAL can be reused to code or decode the GeoJSON fetched from MongoDB
database. Therefore, there would be totally four files to implement,
including ogr_mongo.h, ogrmongodriver.cpp, ogrmongodatasource.cpp, and
ogrmongolayer.cpp.
The write part should be no problem : a no SQL database can receive documents
with a fixed structure.
The read part will need to explore all the documents/features to retrieve
their structure and build a OGR FeatureDefinition. This is done in the CouchDB
driver.
Test Plan
[1] After the MongoDB Driver is compiled into the OGR framework, the
utility ogr2ogr can be used as the test program to import and output
spatial data between shapefile and MongoDB. [2] Conduct a parallel
transformation process to find how fast MongoDB Driver can be compared to
file system and PostGIS.
Time Line
May 19- June 8 (Coding - Phase 1 - 3 weeks)
Preparing the developing environment and bringing GDAL, MongoDB C++ driver
and C++ together, Implementing OGRMongoDriver, OGRMongoDataSource,
OGRMongoLayer according to the interfaces defined by OGRSFDriver,
OGRDataSource and OGRLayer. June 9 - June 23 (Coding - Phase 2 - 2 weeks)
Build MongoDB into the OGR framework, and may first support to exchange a
small size of spatial data with MongoDB, Simultaneously bug fixing. July
24 - July 13 (Coding - Phase 3 - 3 weeks)
Passing the query string (a JSON style document) for both spatial and
attribute data into MongoDB to select features as requested. Compile all
the codes and conduct several tests, fix bugs and make it faster. July 14
- July 27 (Testing - Phase 1 - 2 weeks)
Transfer large scale spatial data with MongoDB using ogr2ogr to see the
driver efficiency. Improve its efficiency and fix bugs. July 28 - August
10 (Testing - Phase 2 - 2 weeks)
Conduct a parallel transformation experiment to find how fast MongoDB
Driver can be compared to file system and PostGIS, and fix bugs. August 11
- August 18 (pencils down)
Write code documentation, includes doxygen comments and techbase/userbase
articles.
You could mention adding support for spatial filtering.
5. Future ideas / How can your idea be expanded?
MongoDB is also an ideal platform for storing massive geo-raster data, so
next job would be writing a MongoDB Driver for raster dataset.
Hum, I'm not sure if MongoDB is aimed at this... You would probably have to
tile the raster to avoid sending/retrieving huge blobs at once
Explain how your SoC task would benefit the OSGeo member project and more
generally the OSGeo Foundation as a whole: MongoDB can be a distributed
and parallel NoSQL spatial database with high performance, high
availability, and easy scalability, thus extremely suitable for large
scale data-intensive computing. By implementing the MongoDB Driver in the
OGR framework, the whole OSGeo ecosystem based on GDAL/OGR will be benefit
from it and powered by MongoDB.
Please provide details of general computing experience: (operating systems
you use on a day-to-day basis, languages you could write a program in,
hardware, networking experience, etc.) During my college time, I mainly
used .NET languages such as C#,VB.net, to build GIS software running on
the Windows platform, while after that and my PhD program beginning most
of my work were done in standard C++ on Linux environment.
Please provide details of previous GIS experience:
I’m a GIS student ever since I attend college. Right now I'm a Ph.D
candidate in Cartography and Geographic Information System, School of
Geographic and Oceanographic Sciences, Nanjing University, China, and a
visiting scholar at Geography & GIScience and NCSA (The National Center
for Supercomputing Applications), UIUC, IL, USA.
Please provide details of any previous involvement with GIS programming and
other software programming: [1] Climate Information Management System of
Shanxi Province: Outstanding Award in ESRI Chinese College Student
Software Development Contest, 2009. [2] Forest Fire Simulation Model based
on Geographic Cellular Automata: Third Prize in ESRI Chinese College
Student Software Development Contest, 2009. [3] High Performance
Geospatial Computing System: HiGIS, (2011-2013)Supported by the National
High Technology Research and Development Program of China (863 project),
in construction. [4] NoSQL Expression of Massive Geospatial Information in
the era of Big Data, (2013-2015) Supported by the Scientific Research
Foundation of Graduate School of Nanjing University, in construction
Please tell us why you are interested in GIS and open source software:
They are powerful and beautiful treasures of humankind, and I want to be
part of it.
Please tell us why you are interested in working for OSGeo and the software
project you have selected: It’s part of my research, since I was trying to
harness MongoDB to support high performance geo-computing.
Please tell us why you are interested in your specific coding project:
I spent lots of time in the past three years learning how GDAL works and
how to employ them into high performance computing applications. So I
believe a new GDAL with MongoDB support will do much good to my current
research.
Would your application contribute to your ongoing studies/ degree? If so,
how? Yes. MongoDB cluster is a good way to handle large quantities of
spatial data, and if OGR provides MongoDB Driver, lots of tools we
developed based on GDAL can be reusable, and powered by MongoDB, thus much
faster.
Please explain how you intend to continue being an active member of your
project and/or OSGeo AFTER the summer is over: I’ll try my best to keep
following this thread to make MongoDB Driver stable and efficient.
Do you understand this is a serious commitment, equivalent to a full-time
paid summer internship or summer job? Yes, I understand. I’ll give my
best.
Do you have any known time conflicts during the official coding period?
(May 19 to August 19) No, I don't.