[GitHub] [systemds] Shafaq-Siddiqi commented on a change in pull request #1119: [WIP] DataWig Design Document

GitBox Wed, 02 Dec 2020 02:55:13 -0800


Shafaq-Siddiqi commented on a change in pull request #1119:
URL: https://github.com/apache/systemds/pull/1119#discussion_r534075365




##########
File path: scripts/staging/datawig/DesignDocument.md
##########
@@ -0,0 +1,52 @@
+# DataWig Design Document
+Julian Rakuschek, Noah Ruhmer
+### Basic Idea
+Let us assume the following table as presented in the corresponding Paper by 
Prof. Biessmann [1]:
+
+| Type          |      Description     |   Size         | Color         |
+| :-----------: | :------------------- | :------------: | :-----------: |
+|  Shoes        | ideal for running    | 12UK           | Black         |
+| SD Card       | for saving files     | 8GB            | Blue          |
+| Dress         | This yellow dress ...| M              | ???           |
+
+The goal is obviously to impute the Color `Yellow` for the Dress, but how do 
we get there?
+
+First we select the feature columns which are `Type`, `Description` and 
`Size`, the label column alias to-be-imputed shall be `Color`
+
+#### Numerical Encoding x<sup>c</sup>
+The first stage is the transformation from strings and categorical data into 
their numerical representation:
+
+| Type          |      Description     |   Size         | Color         |
+| :-----------: | :------------------- | :------------: | :-----------: |
+|  OHE        | Sequential Encoding    | OHE           | OHE         |
+
+Here we can use a One-Hot Encoder (OHE) for categorical data and a sequential 
encoding for strings. As SystemDS already has the bultin function `toOneHot` we 
plan on using that and for the sequential data we would implement a new builtin 
function `sequentialEncode` which would look like this:
+* Let's say Row 1 contains "Shoes" and Row2 "SD Card"
+* First we assign each unique character a unique index, e.g.
+    * `{S: 1, h: 2, o: 3, e: 4, s: 5, D: 6, " ": 7, C: 8, a: 9, r: 10, d: 11}`
+* Then we replace each character in the string with the corresponding 
token-index, which would yield two arrays:
+     * `[1, 2, 3, 4, 5]`
+     * `[1, 6, 7, 8, 9, 10, 11]`
+* This is also the way how the SequentialEncoder in the Python-Implementation 
of DataWig works
+
+#### Feature Extraction
+This is the part where we yet don't fully understand on how this should work. 
In the paper [1], Prof. Biessmann uses one Featurizer for each column, namely 
for One-hot encoded data he uses an embedding and for sequential data they 
either use: "an n-gram representation or a character-based embedding using a 
long short-term memory (LSTM) recurrent neural network" [1, chapter 3]

Review comment:
       Yes, you may write a built-in for sequential encoding in a separate PR. 
Please have a look at our map() built-in this will help you to do string 
processing. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [systemds] Shafaq-Siddiqi commented on a change in pull request #1119: [WIP] DataWig Design Document

Reply via email to