Baunsgaard commented on code in PR #2054:
URL: https://github.com/apache/systemds/pull/2054#discussion_r1696778420


##########
conf/log4j-compression.properties:
##########
@@ -0,0 +1,33 @@
+#-------------------------------------------------------------

Review Comment:
   this file need to be moved or removed, 
   The file is specific for your experiments.
   Optionally we can add it into the scripts/perftest/ohe_checks folder and 
then you modify your scripts to use it from there.



##########
scripts/perftest/ohe_checks/README.md:
##########
@@ -0,0 +1,151 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% end comment %}
+-->
+
+# Checking One Hot Encodedness before Compression tests
+
+To run all tests for One Hot Encoding Checks:
+ * install systemds,
+ * make sure that the paths for SYSTEMDS_ROOT, JAVA_HOME, HADOOP_HOME, 
LOG4JPROP are correctly set

Review Comment:
   this comment is not enough for clear settings of LOG4JPROP, 
   could you just add a bit of details on that specific property for once you 
move the log4j file.



##########
scripts/perftest/ohe_checks/README.md:
##########
@@ -0,0 +1,151 @@
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% end comment %}
+-->
+
+# Checking One Hot Encodedness before Compression tests
+
+To run all tests for One Hot Encoding Checks:
+ * install systemds,
+ * make sure that the paths for SYSTEMDS_ROOT, JAVA_HOME, HADOOP_HOME, 
LOG4JPROP are correctly set
+ * run experiments.sh
+
+Alternatively, to run the experiment.dml script directly with OHE checks 
enabled, use this command: 
+ 
+`$SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar experiment.dml 
--config ohe.xml `
+ 
+Note: You can use -nvargs to set the variables rows, cols, dummy, distinct, 
repeats (how many times you want to generate a random matrix, transform-encode 
it and compress it)
+
+(Dummy is the array of column indexes that you would like to One Hot Encode, 
example: dummy="[1]" will One Hot Encode the first column)
+
+To collect the metrics from the logs for easier comparison, you can run 
`parse_logs.py` and an excel file called `combined_metrics.xlsx` will be 
created in this directory.
+---
+# Documentation of Changes to codebase for Implementing OHE Checks
+
+## Flag to enable/disable OHE checks (Disabled by default)
+- Added ``COMPRESSED_ONEHOTDETECT = "sysds.compressed.onehotdetect"`` to 
``DMLConfig`` and adjusted the relevant methods
+- Added attribute to ``CompressionSettings`` ``public final boolean 
oneHotDetect`` and adjusted the methods
+- Adjusted ``CompressionSettingsBuilder`` to check if 
``COMPRESSED_ONEHOTDETECT`` has been set to true to enable the checks 
+
+## Changes in  `CoCoderFactory` 
+
+### 1. Introduction of OHE Detection
+
+**Condition Addition:**
+- Added a condition to check for `cs.oneHotDetect` along with the existing 
condition `!containsEmptyConstOrIncompressable` in the 
`findCoCodesByPartitioning` method. This ensures that the process considers OHE 
detection only if it is enabled in the compression settings.
+- Original code only checked for `containsEmptyConstOrIncompressable` and 
proceeded to cocode all columns if false. The updated code includes an 
additional check for `cs.oneHotDetect`.
+
+### 2. New Data Structures for OHE Handling
+
+**New Lists:** Introduced two new lists to manage the OHE detection process:
+- `currentCandidates`: To store the current candidate columns that might form 
an OHE group.
+- `oheGroups`: To store lists of columns that have been validated as OHE 
groups.
+
+### 3. Filtering Logic Enhancements
+
+**Column Filtering:** Enhanced the loop that iterates over columns to identify 
OHE candidates:
+- Columns that are empty, constant, or incompressible are filtered into 
respective lists.
+- For other columns, they are added to `currentCandidates` if they are deemed 
candidates (via `isCandidate` function).
+
+### 4. Addition of `isHotEncoded` Function
+
+**Function Creation:** Created a new `isHotEncoded` function to evaluate if 
the accumulated columns form a valid OHE group.
+- **Parameters:** Takes a list of column groups (`colGroups`), a boolean flag 
(`isSample`), an array of non-zero counts (`nnzCols`), and the number of rows 
(`numRows`).
+- **Return Type:** Returns a `String` indicating the status of the current 
candidates:
+  - `"POTENTIAL_OHE"`: When the current candidates could still form an OHE 
group.
+  - `"NOT_OHE"`: When the current candidates cannot form an OHE group.
+  - `"VALID_OHE"`: When the current candidates form a valid OHE group.
+- **Logic:** The function calculates the total number of distinct values and 
offsets, and checks if they meet the criteria for forming an OHE group.
+
+### 5. Enhanced Group Handling
+
+**Candidate Processing:** Within the loop, after adding a column to 
`currentCandidates`:
+- Calls `isHotEncoded` to check the status of the candidates.
+- If `isHotEncoded` returns `"NOT_OHE"`, moves the candidates to regular 
groups and clears the candidates list.
+- If `isHotEncoded` returns `"VALID_OHE"`, moves the candidates to `oheGroups` 
and clears the candidates list.
+- If `isHotEncoded` returns `"POTENTIAL_OHE"`, continues accumulating 
candidates.
+
+### 6. Final Candidate Check
+
+**Post-loop Check:** After the loop, checks any remaining `currentCandidates`:
+- If they form a valid OHE group, adds them to `oheGroups`.
+- Otherwise, adds them to regular groups.
+
+### 7. Overwrite and CoCode Groups
+
+**Overwrite Groups:** Updates `colInfos.compressionInfo` with the processed 
`groups`.
+**OHE Group Integration:** Combines indexes for validated OHE groups and adds 
them to the final `groups`.
+
+## One Hot Encoded Columns Compression in `ColGroupFactory`
+
+### Description
+
+The `compressOHE` function is designed to compress columns that are one-hot 
encoded (OHE). It validates and processes the input data to ensure it meets the 
criteria for one-hot encoding, and if so, it compresses the data accordingly. 
If the data does not meet the OHE criteria, it falls back to a direct 
compression method (`directCompressDDC`).
+
+### Implementation Details
+
+1. **Validation of `numVals`**:
+   - Ensures the number of distinct values (`numVals`) in the column group is 
greater than 0.
+   - Throws a `DMLCompressionException` if `numVals` is less than or equal to 
0.
+
+2. **Handling Transposed Matrix**:
+   - If the matrix is transposed (`cs.transposed` is `true`):
+     - Creates a `MapToFactory` data structure with an additional unique value.
+     - Iterates through the sparse block of the matrix, checking for non-one 
values or multiple ones in the same row.
+     - If a column index in the sparse block is empty, or if non-one values or 
multiple ones are found, it falls back to `directCompressDDC`.
+
+3. **Handling Non-Transposed Matrix**:
+   - If the matrix is not transposed (`cs.transposed` is `false`):
+     - Creates a `MapToFactory` data structure.
+     - Iterates through each row of the matrix:
+       - Checks for the presence of exactly one '1' in the columns specified 
by `colIndexes`.
+       - If multiple ones are found in the same row, or if no '1' is found in 
a sample row, it falls back to `directCompressDDC`.
+
+4. **Return Value**:
+   - If the data meets the OHE criteria, returns a `ColGroupDDC` created with 
the column indexes, an `IdentityDictionary`, and the data.
+   - If the data does not meet the OHE criteria, returns the result of 
`directCompressDDC`.
+
+## Add method in `ColGroupSizes`
+Added method ``estimateInMemorySizeOHE(int nrColumns, boolean 
contiguousColumns, int nrRows)``
+
+## Add method in `AComEst`
+Added a getter method `getNnzCols`
+
+## Edit `distinctCountScale` method in `ComEstSample`

Review Comment:
   all these edit details should not be here, we should just make the fixes to 
the system in general instead.



##########
src/main/java/org/apache/sysds/runtime/compress/estim/CompressedSizeInfo.java:
##########
@@ -95,6 +95,8 @@ public String getEstimatedDistinct() {
                StringBuilder sb = new StringBuilder();
                if(compressionInfo == null)
                        return "";
+               if(compressionInfo.size()<=0)

Review Comment:
   if this happened it is a bug. 
   we should probably throw an exception



##########
scripts/perftest/ohe_checks/experiments.sh:
##########
@@ -0,0 +1,61 @@
+#!/usr/bin/env bash
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+mkdir BaselineLogs
+mkdir OHELogs
+run_base() {
+    $SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar 
experiment.dml \
+    --seed 42 --debug -nvargs rows=$1 cols=$2 dummy="$3" distinct=$4 > 
BaselineLogs/${5}_${1}_rows_${2}_cols_${3}_encoded_base.txt 2>&1
+}
+
+run_ohe() {
+    $SYSTEMDS_ROOT/bin/systemds $SYSTEMDS_ROOT/target/SystemDS.jar 
experiment.dml \
+    --seed 42 --debug --config ohe.xml -nvargs rows=$1 cols=$2 dummy="$3" 
distinct=$4> OHELogs/${5}_${1}_rows_${2}_cols_${3}_encoded_ohe.txt 2>&1
+}
+
+# Run same experiments but checking One-Hot-Encoded columns first
+run_ohe 1000 1 "[1]" 10 1
+run_ohe 1000 5 "[2]" 10 2
+run_ohe 1000 5 "[1,2]" 10 3
+run_ohe 1000 5 "[1,2,3]" 10 4
+run_ohe 1000 5 "[1,2,3,4,5]" 10 5
+run_ohe 1000 10 "[1,3,5]" 10 6
+run_ohe 1000 10 "[1,2,5,6]" 10 7
+run_ohe 100000 1 "[1]" 100 8
+run_ohe 100000 5 "[1,2]" 100 9
+run_ohe 100000 5 "[1,2,3]" 100 10
+run_ohe 100000 100 "[1,3,50,60,70,80]" 100 11
+run_ohe 100000 100 "[1,2,24,25,50,51]" 100 12
+
+# Run baseline experiments
+run_base 1000 1 "[1]" 10 1
+run_base 1000 5 "[2]" 10 2
+run_base 1000 5 "[1,2]" 10 3
+run_base 1000 5 "[1,2,3]" 10 4
+run_base 1000 5 "[1,2,3,4,5]" 10 5
+run_base 1000 10 "[1,3,5]" 10 6
+run_base 1000 10 "[1,2,5,6]" 10 7
+run_base 100000 1 "[1]" 100 8
+run_base 100000 5 "[1,2]" 100 9
+run_base 100000 5 "[1,2,3]" 100 10
+run_base 100000 100 "[1,3,50,60,70,80]" 100 11
+run_base 100000 100 "[1,2,24,25,50,51]" 100 12

Review Comment:
   newline



##########
conf/log4j-compression.properties:
##########
@@ -0,0 +1,33 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+log4j.rootLogger=ERROR, console
+
+log4j.logger.org.apache.sysds=INFO
+log4j.logger.org.apache.sysds.runtime.compress=DEBUG
+log4j.logger.org.apache.spark=ERROR
+log4j.logger.org.apache.spark.SparkContext=OFF
+log4j.logger.org.apache.hadoop=ERROR
+
+log4j.appender.console=org.apache.log4j.ConsoleAppender
+log4j.appender.console.target=System.err
+log4j.appender.console.layout=org.apache.log4j.PatternLayout
+log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
%c{2}: %m%n

Review Comment:
   add newline in the end of files, to make GitHub happy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to