[GitHub] lewismc closed pull request #2: SDAP-63 Submit MUDROD documentation to SDAP Website

GitBox Mon, 23 Apr 2018 10:39:11 -0700

lewismc closed pull request #2: SDAP-63 Submit MUDROD documentation to SDAP 
Website
URL: https://github.com/apache/incubator-sdap-website/pull/2


This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/blog.html b/blog.html
index 50d35a9..7b13316 100644
--- a/blog.html
+++ b/blog.html
@@ -56,6 +56,127 @@ <h1>Blog</h1>
 
 
 
+<a 
href="/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html"><h2>An 
introduction to MUDROD vocabulary similarity calculation algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>Big geospatial data have been produced, archived and made available online, 
but finding the right data for scientific research and decision-support 
applications remains a significant challenge. A long-standing problem in data 
discovery is how to locate, assimilate and utilize the semantic context for a 
given query. Most of past research in geospatial domain attempts to solve this 
problem through two approaches: 1) building a domain-specific ontology  
manually; 2) discovering semantic relationship through dataset metadata 
automatically using machine learning techniques. The former contains rich 
expert knowledge, but it is static, costly, and labour intensive, while the 
latter is automatic, it is prone to noise.</p>
+
+<p>An emerging trend in information science is to take advantage of 
large-scale user search history, which is dynamic but contains user and crawler 
generated noise. Leveraging the benefits of all of these three approaches and 
avoiding their weaknesses, a novel  approach is proposed in this article to 1) 
discover vocabulary semantic relationship from user clickstream; 2) refine the 
similarity calculation methods from existing ontology; 3) integrate the results 
of ontology, metadata, user search history and clickstream analysis to better 
determine the semantic relationship.</p>
+
+<center>
+       <img src="/images/vocabulary.png" />
+       Figure 1. System workflow and architecture
+</center>
+
+<p>The system starts by pre-processing raw web logs, metadata, and ontology 
(Figure 1 ). After pre-processing step, search history and clickstream data are 
extracted from raw logs, selected properties are extracted from metadata, and 
ocean-related triples are extracted from the SWEET ontology. These four types 
of processed data are then put into their corresponding processer as discussed 
in the last section. Once all the processers finish their jobs, the results of 
different methods are integrated to produce a final most related terms list.</p>
+
+
+
+
+<a href="/weekly/update/2018/04/23/recommendation-algorithms.html"><h2>An 
introduction to MUDROD recommendation algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>With the recent advances in remote sensing satellites and other sensors, 
geographic datasets have been growing faster than ever. In response, a number 
of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) 
have been developed to archive and made those datasets available online. 
However, finding the right data for scientific research and application 
development is still a challenge due to the lack of data relevancy 
information.</p>
+
+<p>Recommendation has become extremely common in recent years and are utilized 
in a variety of areas to help users quickly find useful information. Wee 
propose a recommendation system to improve geographic data discovery by mining 
and utilizing metadata and usage logs. Metadata abstracts are processed with 
natural language processing methods to find semantic relationship between 
metadata. Metadata variables are used to calculate spatial and temporal 
similarity between metadata. In addition, portal logs are analysed to introduce 
user preference.</p>
+
+<center>
+       <img src="/images/recommendation.png" />
+       Figure 1. Recommendation workflow
+</center>
+
+<p>The system starts by pre-processing raw web logs and metadata (Figure 1). 
After pre-processing step, sessions are reconstructed from raw web logs and 
then used to calculate session-based metadata similarity. Metadata are 
harvested from PO. DAAC web service APIs. Metadata variable values are then 
converted to value using the united unit to calculate metadata content 
similarity. All these above similarities are calculated offline and stored in 
Elasticsearch. Once user views a metadata, the system finds the top-k related 
metadata with hybrid recommendation. The hybrid recommendation module 
integrates results from content-based recommendation and session-based 
recommendation methods and ranks the final recommendation list in a descending 
order of similarity.</p>
+
+
+
+
+<a href="/weekly/update/2018/04/23/ranking-algorithms.html"><h2>An 
introduction to MUDROD ranking algorithm</h2></a>
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+<p>When a user types some keywords into a search engine, there are typically 
hundreds, or even thousands of datasets related to the given query. Although 
high level of recall can be useful in some cases, the user is only interested 
in a much smaller subset. Current search engines in most geospatial data 
portals tend to induce end users to focus on one single data 
characteristic/feature dimension (e.g., spatial resolution), which often 
results in less than optimal user experience (Ghose, Ipeirotis, and Li 
2012).</p>
+
+<p>To overcome this fundamental ranking problem, we therefore 1) identify a 
number of ranking features of geospatial data to represent users’ 
multidimensional preferences by considering semantics, user behaviour, spatial 
similarity, and static dataset metadata attributes; 2) apply machine learning 
method to automatically learn a function from a training set capable of ranking 
geospatial data according to the ranking features.</p>
+
+<p>Within the ranking process, each query will be associated with a set of 
data, and each data can be represented as a feature vector. Eleven features 
listed below are identified by considering user behaviour, query-text match and 
 examining common geospatial metadata attributes.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Lucene relevance score</td>
+    </tr>
+    <tr>
+      <td>Semantic popularity</td>
+    </tr>
+    <tr>
+      <td>Spatial Similarity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Release date</td>
+    </tr>
+    <tr>
+      <td>Processing level</td>
+    </tr>
+    <tr>
+      <td>Version number</td>
+    </tr>
+    <tr>
+      <td>Spatial resolution</td>
+    </tr>
+    <tr>
+      <td>Temporal resolution</td>
+    </tr>
+    <tr>
+      <td>All-time popularity</td>
+    </tr>
+    <tr>
+      <td>Monthly popularity</td>
+    </tr>
+    <tr>
+      <td>User popularity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<p>RankSVM, one of the well-recognized learning approach is selected to learn 
feature weights to rank search results. In RankSVM (Joachims 2002), ranking is 
transformed into a pairwise classification task in which a classifier is 
trained to predict the ranking order of data pairs.</p>
+
+<center>
+       <img src="/images/ranking.png" />
+       Figure 1. System workflow and architecture
+</center>
+
+<p>The proposed architecture primarily consists of six components comprising 
semantic knowledge base, geocoding service, search index, feature extractor, 
learning algorithm, and ranking model respectively (Figure 1). When a user 
submits a query, it is then converted into a semantic query and a geographical 
bounding box by the semantic knowledge base and geocoding service. The search 
index would then return the top K results for the semantic query combined with 
the bounding box. After that, feature extractor would extract the ranking 
features for each of the search results, including the semantic click score. 
Once all the features are prepared, the top K results would then be put into a 
pre-trained ranking model, which would finally re-rank the top K retrieval. As 
the index in this architecture can be any Lucene-based software, it enables the 
loosely coupled software structure of a data portal and avoids the cost of 
replacing the existing system.</p>
+
+<p>Reference:</p>
+<ul>
+  <li>
+    <p>Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. “Designing 
ranking systems for hotels on travel search engines by mining user-generated 
and crowdsourced content.”  Marketing Science 31 (3):493-520.</p>
+  </li>
+  <li>
+    <p>Joachims, Thorsten. 2002. Optimizing search engines using clickthrough 
data. Paper presented at the Proceedings of the eighth ACM SIGKDD international 
conference on Knowledge discovery and data mining.</p>
+  </li>
+</ul>
+
+
+
+
       <!-- footer -->
       <nav class="navbar navbar-default">
         <div class="navbar-header">
diff --git a/images/architecture.jpg b/images/architecture.jpg
new file mode 100644
index 0000000..6d14285
Binary files /dev/null and b/images/architecture.jpg differ
diff --git a/images/cover.jpg b/images/cover.jpg
new file mode 100644
index 0000000..aa39689
Binary files /dev/null and b/images/cover.jpg differ
diff --git a/images/ranking.png b/images/ranking.png
new file mode 100644
index 0000000..17dd504
Binary files /dev/null and b/images/ranking.png differ
diff --git a/images/recommendation.png b/images/recommendation.png
new file mode 100644
index 0000000..2c19d2e
Binary files /dev/null and b/images/recommendation.png differ
diff --git a/images/vocabulary.png b/images/vocabulary.png
new file mode 100644
index 0000000..9f7dda8
Binary files /dev/null and b/images/vocabulary.png differ
diff --git a/source/_posts/2018-04-23-ranking-algorithms.markdown 
b/source/_posts/2018-04-23-ranking-algorithms.markdown
new file mode 100644
index 0000000..7d0ff25
--- /dev/null
+++ b/source/_posts/2018-04-23-ranking-algorithms.markdown
@@ -0,0 +1,46 @@
+---
+layout: post
+title:  "An introduction to MUDROD ranking algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+When a user types some keywords into a search engine, there are typically 
hundreds, or even thousands of datasets related to the given query. Although 
high level of recall can be useful in some cases, the user is only interested 
in a much smaller subset. Current search engines in most geospatial data 
portals tend to induce end users to focus on one single data 
characteristic/feature dimension (e.g., spatial resolution), which often 
results in less than optimal user experience (Ghose, Ipeirotis, and Li 2012). 
+
+To overcome this fundamental ranking problem, we therefore 1) identify a 
number of ranking features of geospatial data to represent users’ 
multidimensional preferences by considering semantics, user behaviour, spatial 
similarity, and static dataset metadata attributes; 2) apply machine learning 
method to automatically learn a function from a training set capable of ranking 
geospatial data according to the ranking features.
+
+Within the ranking process, each query will be associated with a set of data, 
and each data can be represented as a feature vector. Eleven features listed 
below are identified by considering user behaviour, query-text match and  
examining common geospatial metadata attributes.
+
+  | Query-dependent features        | 
+    | --------   | 
+    | Lucene relevance score        | 
+    | Semantic popularity        |
+    | Spatial Similarity        | 
+  |         |
+  
+  | Query-dependent features        | 
+       | --------   | 
+       | Release date        | 
+    | Processing level        | 
+    | Version number        | 
+    | Spatial resolution        | 
+    | Temporal resolution        |
+    | All-time popularity        | 
+    | Monthly popularity        | 
+    | User popularity        | 
+  |         |
+       
+       
+RankSVM, one of the well-recognized learning approach is selected to learn 
feature weights to rank search results. In RankSVM (Joachims 2002), ranking is 
transformed into a pairwise classification task in which a classifier is 
trained to predict the ranking order of data pairs.
+
+<center>
+       <img src="/images/ranking.png">
+       Figure 1. System workflow and architecture
+</center>
+
+The proposed architecture primarily consists of six components comprising 
semantic knowledge base, geocoding service, search index, feature extractor, 
learning algorithm, and ranking model respectively (Figure 1). When a user 
submits a query, it is then converted into a semantic query and a geographical 
bounding box by the semantic knowledge base and geocoding service. The search 
index would then return the top K results for the semantic query combined with 
the bounding box. After that, feature extractor would extract the ranking 
features for each of the search results, including the semantic click score. 
Once all the features are prepared, the top K results would then be put into a 
pre-trained ranking model, which would finally re-rank the top K retrieval. As 
the index in this architecture can be any Lucene-based software, it enables the 
loosely coupled software structure of a data portal and avoids the cost of 
replacing the existing system.
+
+Reference:
+* Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. "Designing 
ranking systems for hotels on travel search engines by mining user-generated 
and crowdsourced content."  Marketing Science 31 (3):493-520.
+
+* Joachims, Thorsten. 2002. Optimizing search engines using clickthrough data. 
Paper presented at the Proceedings of the eighth ACM SIGKDD international 
conference on Knowledge discovery and data mining. 
diff --git a/source/_posts/2018-04-23-recommendation-algorithms.markdown 
b/source/_posts/2018-04-23-recommendation-algorithms.markdown
new file mode 100644
index 0000000..da0713c
--- /dev/null
+++ b/source/_posts/2018-04-23-recommendation-algorithms.markdown
@@ -0,0 +1,18 @@
+---
+layout: post
+title:  "An introduction to MUDROD recommendation algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+With the recent advances in remote sensing satellites and other sensors, 
geographic datasets have been growing faster than ever. In response, a number 
of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) 
have been developed to archive and made those datasets available online. 
However, finding the right data for scientific research and application 
development is still a challenge due to the lack of data relevancy information. 
+
+Recommendation has become extremely common in recent years and are utilized in 
a variety of areas to help users quickly find useful information. Wee propose a 
recommendation system to improve geographic data discovery by mining and 
utilizing metadata and usage logs. Metadata abstracts are processed with 
natural language processing methods to find semantic relationship between 
metadata. Metadata variables are used to calculate spatial and temporal 
similarity between metadata. In addition, portal logs are analysed to introduce 
user preference. 
+
+<center>
+       <img src="/images/recommendation.png">
+       Figure 1. Recommendation workflow
+</center>
+
+
+The system starts by pre-processing raw web logs and metadata (Figure 1). 
After pre-processing step, sessions are reconstructed from raw web logs and 
then used to calculate session-based metadata similarity. Metadata are 
harvested from PO. DAAC web service APIs. Metadata variable values are then 
converted to value using the united unit to calculate metadata content 
similarity. All these above similarities are calculated offline and stored in 
Elasticsearch. Once user views a metadata, the system finds the top-k related 
metadata with hybrid recommendation. The hybrid recommendation module 
integrates results from content-based recommendation and session-based 
recommendation methods and ranks the final recommendation list in a descending 
order of similarity.
diff --git a/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown 
b/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown
new file mode 100644
index 0000000..edd7d36
--- /dev/null
+++ b/source/_posts/2018-04-23-vocabulary-similarity-algorithm.markdown
@@ -0,0 +1,18 @@
+---
+layout: post
+title:  "An introduction to MUDROD vocabulary similarity calculation algorithm"
+categories: weekly update
+author: Lewis John McGibbney
+---
+
+Big geospatial data have been produced, archived and made available online, 
but finding the right data for scientific research and decision-support 
applications remains a significant challenge. A long-standing problem in data 
discovery is how to locate, assimilate and utilize the semantic context for a 
given query. Most of past research in geospatial domain attempts to solve this 
problem through two approaches: 1) building a domain-specific ontology  
manually; 2) discovering semantic relationship through dataset metadata 
automatically using machine learning techniques. The former contains rich 
expert knowledge, but it is static, costly, and labour intensive, while the 
latter is automatic, it is prone to noise. 
+
+An emerging trend in information science is to take advantage of large-scale 
user search history, which is dynamic but contains user and crawler generated 
noise. Leveraging the benefits of all of these three approaches and avoiding 
their weaknesses, a novel  approach is proposed in this article to 1) discover 
vocabulary semantic relationship from user clickstream; 2) refine the 
similarity calculation methods from existing ontology; 3) integrate the results 
of ontology, metadata, user search history and clickstream analysis to better 
determine the semantic relationship. 
+
+<center>
+       <img src="/images/vocabulary.png">
+       Figure 1. System workflow and architecture
+</center>
+
+
+The system starts by pre-processing raw web logs, metadata, and ontology 
(Figure 1 ). After pre-processing step, search history and clickstream data are 
extracted from raw logs, selected properties are extracted from metadata, and 
ocean-related triples are extracted from the SWEET ontology. These four types 
of processed data are then put into their corresponding processer as discussed 
in the last section. Once all the processers finish their jobs, the results of 
different methods are integrated to produce a final most related terms list.
diff --git a/source/images/architecture.jpg b/source/images/architecture.jpg
new file mode 100644
index 0000000..6d14285
Binary files /dev/null and b/source/images/architecture.jpg differ
diff --git a/source/images/cover.jpg b/source/images/cover.jpg
new file mode 100644
index 0000000..aa39689
Binary files /dev/null and b/source/images/cover.jpg differ
diff --git a/source/images/ranking.png b/source/images/ranking.png
new file mode 100644
index 0000000..17dd504
Binary files /dev/null and b/source/images/ranking.png differ
diff --git a/source/images/recommendation.png b/source/images/recommendation.png
new file mode 100644
index 0000000..2c19d2e
Binary files /dev/null and b/source/images/recommendation.png differ
diff --git a/source/images/vocabulary.png b/source/images/vocabulary.png
new file mode 100644
index 0000000..9f7dda8
Binary files /dev/null and b/source/images/vocabulary.png differ
diff --git a/weekly/update/2018/04/23/ranking-algorithms.html 
b/weekly/update/2018/04/23/ranking-algorithms.html
new file mode 100644
index 0000000..8417575
--- /dev/null
+++ b/weekly/update/2018/04/23/ranking-algorithms.html
@@ -0,0 +1,169 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css";
 />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css";
 />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" 
href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language="; 
title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org";>
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+               <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Apache <span 
class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a 
href="http://www.apache.org/foundation/how-it-works.html";>Apache Software 
Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/";>Apache 
License</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/sponsorship";>Sponsorship</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD ranking algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>When a user types some keywords into a search engine, there are typically 
hundreds, or even thousands of datasets related to the given query. Although 
high level of recall can be useful in some cases, the user is only interested 
in a much smaller subset. Current search engines in most geospatial data 
portals tend to induce end users to focus on one single data 
characteristic/feature dimension (e.g., spatial resolution), which often 
results in less than optimal user experience (Ghose, Ipeirotis, and Li 
2012).</p>
+
+<p>To overcome this fundamental ranking problem, we therefore 1) identify a 
number of ranking features of geospatial data to represent users’ 
multidimensional preferences by considering semantics, user behaviour, spatial 
similarity, and static dataset metadata attributes; 2) apply machine learning 
method to automatically learn a function from a training set capable of ranking 
geospatial data according to the ranking features.</p>
+
+<p>Within the ranking process, each query will be associated with a set of 
data, and each data can be represented as a feature vector. Eleven features 
listed below are identified by considering user behaviour, query-text match and 
 examining common geospatial metadata attributes.</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Lucene relevance score</td>
+    </tr>
+    <tr>
+      <td>Semantic popularity</td>
+    </tr>
+    <tr>
+      <td>Spatial Similarity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<table>
+  <thead>
+    <tr>
+      <th>Query-dependent features</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>Release date</td>
+    </tr>
+    <tr>
+      <td>Processing level</td>
+    </tr>
+    <tr>
+      <td>Version number</td>
+    </tr>
+    <tr>
+      <td>Spatial resolution</td>
+    </tr>
+    <tr>
+      <td>Temporal resolution</td>
+    </tr>
+    <tr>
+      <td>All-time popularity</td>
+    </tr>
+    <tr>
+      <td>Monthly popularity</td>
+    </tr>
+    <tr>
+      <td>User popularity</td>
+    </tr>
+    <tr>
+      <td> </td>
+    </tr>
+  </tbody>
+</table>
+
+<p>RankSVM, one of the well-recognized learning approach is selected to learn 
feature weights to rank search results. In RankSVM (Joachims 2002), ranking is 
transformed into a pairwise classification task in which a classifier is 
trained to predict the ranking order of data pairs.</p>
+
+<center>
+       <img src="/images/ranking.png" />
+       Figure 1. System workflow and architecture
+</center>
+
+<p>The proposed architecture primarily consists of six components comprising 
semantic knowledge base, geocoding service, search index, feature extractor, 
learning algorithm, and ranking model respectively (Figure 1). When a user 
submits a query, it is then converted into a semantic query and a geographical 
bounding box by the semantic knowledge base and geocoding service. The search 
index would then return the top K results for the semantic query combined with 
the bounding box. After that, feature extractor would extract the ranking 
features for each of the search results, including the semantic click score. 
Once all the features are prepared, the top K results would then be put into a 
pre-trained ranking model, which would finally re-rank the top K retrieval. As 
the index in this architecture can be any Lucene-based software, it enables the 
loosely coupled software structure of a data portal and avoids the cost of 
replacing the existing system.</p>
+
+<p>Reference:</p>
+<ul>
+  <li>
+    <p>Ghose, Anindya, Panagiotis G Ipeirotis, and Beibei Li. 2012. “Designing 
ranking systems for hotels on travel search engines by mining user-generated 
and crowdsourced content.”  Marketing Science 31 (3):493-520.</p>
+  </li>
+  <li>
+    <p>Joachims, Thorsten. 2002. Optimizing search engines using clickthrough 
data. Paper presented at the Proceedings of the eighth ACM SIGKDD international 
conference on Knowledge discovery and data mining.</p>
+  </li>
+</ul>
+
+
+<div>
+
+</div>
+
+<div>
+
+<b>Next:</b> <a 
href="/weekly/update/2018/04/23/recommendation-algorithms.html">An introduction 
to MUDROD recommendation algorithm</a>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software 
Foundation. Licensed under <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache 
SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort 
undergoing <a href="https://incubator.apache.org/";>Incubation</a> at The Apache 
Software Foundation (ASF), sponsored by the Incubator. Incubation is required 
of all newly accepted projects until a further review indicates that the 
infrastructure, communications, and decision making process have stabilized in 
a manner consistent with other successful ASF projects. While incubation status 
is not necessarily a reflection of the completeness or stability of the code, 
it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+
diff --git a/weekly/update/2018/04/23/recommendation-algorithms.html 
b/weekly/update/2018/04/23/recommendation-algorithms.html
new file mode 100644
index 0000000..07181b5
--- /dev/null
+++ b/weekly/update/2018/04/23/recommendation-algorithms.html
@@ -0,0 +1,98 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css";
 />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css";
 />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" 
href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language="; 
title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org";>
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+               <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Apache <span 
class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a 
href="http://www.apache.org/foundation/how-it-works.html";>Apache Software 
Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/";>Apache 
License</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/sponsorship";>Sponsorship</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD recommendation algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>With the recent advances in remote sensing satellites and other sensors, 
geographic datasets have been growing faster than ever. In response, a number 
of Spatial Data Infrastructure (SDI) components (e.g. catalogues and portals) 
have been developed to archive and made those datasets available online. 
However, finding the right data for scientific research and application 
development is still a challenge due to the lack of data relevancy 
information.</p>
+
+<p>Recommendation has become extremely common in recent years and are utilized 
in a variety of areas to help users quickly find useful information. Wee 
propose a recommendation system to improve geographic data discovery by mining 
and utilizing metadata and usage logs. Metadata abstracts are processed with 
natural language processing methods to find semantic relationship between 
metadata. Metadata variables are used to calculate spatial and temporal 
similarity between metadata. In addition, portal logs are analysed to introduce 
user preference.</p>
+
+<center>
+       <img src="/images/recommendation.png" />
+       Figure 1. Recommendation workflow
+</center>
+
+<p>The system starts by pre-processing raw web logs and metadata (Figure 1). 
After pre-processing step, sessions are reconstructed from raw web logs and 
then used to calculate session-based metadata similarity. Metadata are 
harvested from PO. DAAC web service APIs. Metadata variable values are then 
converted to value using the united unit to calculate metadata content 
similarity. All these above similarities are calculated offline and stored in 
Elasticsearch. Once user views a metadata, the system finds the top-k related 
metadata with hybrid recommendation. The hybrid recommendation module 
integrates results from content-based recommendation and session-based 
recommendation methods and ranks the final recommendation list in a descending 
order of similarity.</p>
+
+
+<div>
+
+<b>Previous:</b> <a 
href="/weekly/update/2018/04/23/ranking-algorithms.html">An introduction to 
MUDROD ranking algorithm</a>
+
+</div>
+
+<div>
+
+<b>Next:</b> <a 
href="/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html">An 
introduction to MUDROD vocabulary similarity calculation algorithm</a>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software 
Foundation. Licensed under <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache 
SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort 
undergoing <a href="https://incubator.apache.org/";>Incubation</a> at The Apache 
Software Foundation (ASF), sponsored by the Incubator. Incubation is required 
of all newly accepted projects until a further review indicates that the 
infrastructure, communications, and decision making process have stabilized in 
a manner consistent with other successful ASF projects. While incubation status 
is not necessarily a reflection of the completeness or stability of the code, 
it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+
diff --git a/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html 
b/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html
new file mode 100644
index 0000000..45d0d2d
--- /dev/null
+++ b/weekly/update/2018/04/23/vocabulary-similarity-algorithm.html
@@ -0,0 +1,96 @@
+<!DOCTYPE html>
+
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Apache SDAP - Science Data Analytics Platform</title>
+    <link rel="shortcut icon" href="favicon.ico" />
+    <link rel="icon" type="image/png" href="favicon.png" />
+    <link rel="stylesheet" href="css/bootstrap.min.css" />
+    <link rel="stylesheet" href="css/style.css" />
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.min.css";
 />
+    <!--[if lt IE 9]>
+    <link rel="stylesheet" 
href="https://cdnjs.cloudflare.com/ajax/libs/github-fork-ribbon-css/0.2.0/gh-fork-ribbon.ie.min.css";
 />
+    <![endif]--> 
+  </head>
+  <body>
+    <a class="github-fork-ribbon" 
href="https://github.com/apache?utf8=✓&q=incubator-sdap&type=&language="; 
title="Fork me on GitHub" target="_blank" >Fork me on GitHub</a> 
+    <div class="container">
+
+      <div class="logos">
+        <a href="https://incubator.apache.org";>
+          <img src="/images/egg-logo.png" class="pull-left" />
+        </a>
+      </div>
+
+      <!-- navigation bar -->
+      <nav class="navbar navbar-default">
+        <div class="container-fluid">
+          <div class="navbar-header">
+            <a class="navbar-brand" href="/">SDAP</a>
+          </div>
+          <div class="navbar-right">
+            <ul class="nav navbar-nav">
+              <li><a href="/docs">Docs</a></li>
+              <li><a href="/blog">Blog</a></li>
+              <li><a href="/team">Team &amp; Community</a></li>
+              <li><a href="/resources">Resources</a></li>
+              <li class="dropdown toggle">
+               <a href="#" class="dropdown-toggle" data-toggle="dropdown" 
role="button" aria-haspopup="true" aria-expanded="false">Apache <span 
class="caret"></span></a>
+                <ul class="dropdown-menu">
+                  <li><a 
href="http://www.apache.org/foundation/how-it-works.html";>Apache Software 
Foundation</a></li>
+                  <li><a href="http://www.apache.org/licenses/";>Apache 
License</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/sponsorship";>Sponsorship</a></li>
+                  <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+                </ul>
+              </li>
+            </ul>
+          </div>
+        </div>
+      </nav>
+
+
+<h1>An introduction to MUDROD vocabulary similarity calculation algorithm</h1>
+
+<p>Posted <b>2018-04-23</b> by <b>Lewis John McGibbney</b></p>
+
+<p>Big geospatial data have been produced, archived and made available online, 
but finding the right data for scientific research and decision-support 
applications remains a significant challenge. A long-standing problem in data 
discovery is how to locate, assimilate and utilize the semantic context for a 
given query. Most of past research in geospatial domain attempts to solve this 
problem through two approaches: 1) building a domain-specific ontology  
manually; 2) discovering semantic relationship through dataset metadata 
automatically using machine learning techniques. The former contains rich 
expert knowledge, but it is static, costly, and labour intensive, while the 
latter is automatic, it is prone to noise.</p>
+
+<p>An emerging trend in information science is to take advantage of 
large-scale user search history, which is dynamic but contains user and crawler 
generated noise. Leveraging the benefits of all of these three approaches and 
avoiding their weaknesses, a novel  approach is proposed in this article to 1) 
discover vocabulary semantic relationship from user clickstream; 2) refine the 
similarity calculation methods from existing ontology; 3) integrate the results 
of ontology, metadata, user search history and clickstream analysis to better 
determine the semantic relationship.</p>
+
+<center>
+       <img src="/images/vocabulary.png" />
+       Figure 1. System workflow and architecture
+</center>
+
+<p>The system starts by pre-processing raw web logs, metadata, and ontology 
(Figure 1 ). After pre-processing step, search history and clickstream data are 
extracted from raw logs, selected properties are extracted from metadata, and 
ocean-related triples are extracted from the SWEET ontology. These four types 
of processed data are then put into their corresponding processer as discussed 
in the last section. Once all the processers finish their jobs, the results of 
different methods are integrated to produce a final most related terms list.</p>
+
+
+<div>
+
+<b>Previous:</b> <a 
href="/weekly/update/2018/04/23/recommendation-algorithms.html">An introduction 
to MUDROD recommendation algorithm</a>
+
+</div>
+
+<div>
+
+</div>
+
+      <!-- footer -->
+      <nav class="navbar navbar-default">
+        <div class="navbar-header">
+          <a class="navbar-brand" href="">SDAP</a>
+        </div>
+        <div class="navbar-text pull-right">&copy; 2017 The Apache Software 
Foundation. Licensed under <a 
href="http://www.apache.org/licenses/LICENSE-2.0";>Apache License 2.0</a>.<br/>
+        Apache SDAP, SDAP, Apache, the Apache feather logo, and the Apache 
SDAP project logo are trademarks of The Apache Software Foundation.</div>
+        <div class="navbar-text pull-right">Apache SDAP is an effort 
undergoing <a href="https://incubator.apache.org/";>Incubation</a> at The Apache 
Software Foundation (ASF), sponsored by the Incubator. Incubation is required 
of all newly accepted projects until a further review indicates that the 
infrastructure, communications, and decision making process have stabilized in 
a manner consistent with other successful ASF projects. While incubation status 
is not necessarily a reflection of the completeness or stability of the code, 
it does indicate that the project has yet to be fully endorsed by the ASF.</div>
+      </nav>
+
+      <script src="js/jquery.min.js"></script>
+      <script src="js/bootstrap.min.js"></script>
+    </div>
+  </body>
+</html>
+


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] lewismc closed pull request #2: SDAP-63 Submit MUDROD documentation to SDAP Website

Reply via email to