Hello community,

here is the log from the commit of package python-logreduce for 
openSUSE:Factory checked in at 2019-04-01 12:35:03
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/python-logreduce (Old)
 and      /work/SRC/openSUSE:Factory/.python-logreduce.new.25356 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Package is "python-logreduce"

Mon Apr  1 12:35:03 2019 rev:7 rq:686071 version:0.4.0

Changes:
--------
--- /work/SRC/openSUSE:Factory/python-logreduce/python-logreduce.changes        
2018-12-24 11:39:25.645556701 +0100
+++ 
/work/SRC/openSUSE:Factory/.python-logreduce.new.25356/python-logreduce.changes 
    2019-04-01 12:35:10.017825784 +0200
@@ -1,0 +2,48 @@
+Mon Mar 18 09:23:25 UTC 2019 - Dirk Mueller <[email protected]>
+
+- update to 0.4.0:
+  * Bump model version and fix typo
+  * Add HashingAnnoy model
+  * Add hashing\_nn benchmark in doc string
+  * Add HashingApproximateNeighbors model
+  * Implement iterator interface for file-like objects
+  * Refactor TokenizerTests
+  * Provide a bit more info about timings of the training
+  * Remove support for bag-of-words\_lshf
+  * Don't store duplicate data in model
+  * Fix heat\_uuid regexp formatting
+  * Relax digits\_re again a bit
+  * Vectorizer optimisation: don't do word analysing
+  * debug\_lineprocess: Handle more than one input file
+  * debug\_lineprocess: Format output slightly nicer and remove duplicates
+  * Tighten heat\_uuid regexp
+  * Tighten length-based regexp matches properly
+  * debug\_lineprocess add some simple word / token statistics
+  * Blacklist .xml extension
+  * Use for loop instead of handcrafted while construct
+  * tests: use free tcp port for gearman server
+  * Add --model-type argument to top-level command
+  * tokenizer: remove sshd warnings
+  * Make debugging scripts callable again
+  * Reduce code duplication a bit
+  * Micro-optimize the tokenization
+  * ci: enable gate jobs
+  * Make systemd service file SCL independent
+  * Transition webui related files to the log-classify name
+  * Match uuid\_re before heat\_re
+  * Use SqlAlchemy intrinsics for ordering
+  * Fix overly greedy date tokenization
+  * Fix tokenization error on removing SSH fingerprints
+  * DRY: Remove implementation override that also exists in the base class
+  * Fix assertEquals() deprecation warning
+  * Use generator for reading files
+  * tokenizer regexp speedups
+  * cmd: add --json argument to report options
+  * Spelling typos
+  * logreduce: Fix inconsistency for model\_file in model-run
+  * logreduce.spec: Fixes
+  * README: Add openSUSE instructions
+  * Add py36/py37 to the env list as well
+  * Run pep8 against pip installed flake8
+
+-------------------------------------------------------------------

Old:
----
  logreduce-0.3.0.tar.gz

New:
----
  logreduce-0.4.0.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ python-logreduce.spec ++++++
--- /var/tmp/diff_new_pack.o1KJfu/_old  2019-04-01 12:35:11.441826149 +0200
+++ /var/tmp/diff_new_pack.o1KJfu/_new  2019-04-01 12:35:11.441826149 +0200
@@ -1,7 +1,7 @@
 #
 # spec file for package python-logreduce
 #
-# Copyright (c) 2018 SUSE LINUX GmbH, Nuernberg, Germany.
+# Copyright (c) 2019 SUSE LINUX GmbH, Nuernberg, Germany.
 #
 # All modifications and additions to the file contributed by third parties
 # remain the property of their copyright owners, unless otherwise agreed
@@ -19,7 +19,7 @@
 %{?!python_module:%define python_module() python-%{**} python3-%{**}}
 %define skip_python2 1
 Name:           python-logreduce
-Version:        0.3.0
+Version:        0.4.0
 Release:        0
 Summary:        Log file anomaly extractor
 License:        Apache-2.0

++++++ logreduce-0.3.0.tar.gz -> logreduce-0.4.0.tar.gz ++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/.zuul.yaml 
new/logreduce-0.4.0/.zuul.yaml
--- old/logreduce-0.3.0/.zuul.yaml      2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/.zuul.yaml      2018-11-08 08:39:56.000000000 +0100
@@ -10,7 +10,7 @@
     nodeset:
       nodes:
         - name: container
-          label: f27-oci
+          label: runc-fedora
 
 - project:
     name: logreduce
@@ -21,15 +21,25 @@
             nodeset:
               nodes:
                 - name: testrunner
-                  label: fedora-oci
+                  label: runc-fedora
         - tox-py35:
             nodeset:
               nodes:
                 - name: testrunner
-                  label: fedora-oci
+                  label: runc-fedora
     gate:
       jobs:
-        - noop
+        - logreduce-tests
+        - tox-pep8:
+            nodeset:
+              nodes:
+                - name: testrunner
+                  label: runc-fedora
+        - tox-py35:
+            nodeset:
+              nodes:
+                - name: testrunner
+                  label: runc-fedora
     release:
       jobs:
         - upload-pypi
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/ChangeLog 
new/logreduce-0.4.0/ChangeLog
--- old/logreduce-0.3.0/ChangeLog       2018-10-25 11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/ChangeLog       2018-11-08 08:40:11.000000000 +0100
@@ -1,6 +1,53 @@
 CHANGES
 =======
 
+0.4.0
+-----
+
+* Bump model version and fix typo
+* Add HashingAnnoy model
+* Add hashing\_nn benchmark in doc string
+* Add HashingApproximateNeighbors model
+* Implement iterator interface for file-like objects
+* Refactor TokenizerTests
+* Provide a bit more info about timings of the training
+* Remove support for bag-of-words\_lshf
+* Don't store duplicate data in model
+* Fix heat\_uuid regexp formatting
+* Relax digits\_re again a bit
+* Vectorizer optimisation: don't do word analysing
+* debug\_lineprocess: Handle more than one input file
+* debug\_lineprocess: Format output slightly nicer and remove duplicates
+* Tighten heat\_uuid regexp
+* Tighten length-based regexp matches properly
+* debug\_lineprocess add some simple word / token statistics
+* Blacklist .xml extension
+* Use for loop instead of handcrafted while construct
+* tests: use free tcp port for gearman server
+* Add --model-type argument to top-level command
+* tokenizer: remove sshd warnings
+* Make debugging scripts callable again
+* Reduce code duplication a bit
+* Micro-optimize the tokenization
+* ci: enable gate jobs
+* Make systemd service file SCL independent
+* Transition webui related files to the log-classify name
+* Match uuid\_re before heat\_re
+* Use SqlAlchemy intrinsics for ordering
+* Fix overly greedy date tokenization
+* Fix tokenization error on removing SSH fingerprints
+* DRY: Remove implementation override that also exists in the base class
+* Fix assertEquals() deprecation warning
+* Use generator for reading files
+* tokenizer regexp speedups
+* cmd: add --json argument to report options
+* Spelling typos
+* logreduce: Fix inconsistency for model\_file in model-run
+* logreduce.spec: Fixes
+* README: Add openSUSE instructions
+* Add py36/py37 to the env list as well
+* Run pep8 against pip installed flake8
+
 0.3.0
 -----
 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/PKG-INFO new/logreduce-0.4.0/PKG-INFO
--- old/logreduce-0.3.0/PKG-INFO        2018-10-25 11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/PKG-INFO        2018-11-08 08:40:11.000000000 +0100
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: logreduce
-Version: 0.3.0
+Version: 0.4.0
 Summary: Extract anomalies from log files
 Home-page: https://logreduce.softwarefactory-project.io/
 Author: Tristan Cacqueray
@@ -52,6 +52,18 @@
           python3 setup.py develop --user
           popd
         
+        
+        * openSUSE:
+        
+        .. code-block:: console
+        
+          sudo zypper install python3-scikit-learn
+          git clone https://softwarefactory-project.io/r/logreduce
+          pushd logreduce
+          python3 setup.py develop --user
+          popd
+        
+        
         * Pip:
         
         .. code-block:: console
@@ -159,7 +171,7 @@
         * logreduce-server: the REST and Gearman server
         * logreduce-worker: job executor
         * logreduce-client: client cli
-        * logreduce-ui: web ui
+        * logreduce-webui: logreduce web interface
         
         API
         ...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/README.rst 
new/logreduce-0.4.0/README.rst
--- old/logreduce-0.3.0/README.rst      2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/README.rst      2018-11-08 08:39:56.000000000 +0100
@@ -44,6 +44,18 @@
   python3 setup.py develop --user
   popd
 
+
+* openSUSE:
+
+.. code-block:: console
+
+  sudo zypper install python3-scikit-learn
+  git clone https://softwarefactory-project.io/r/logreduce
+  pushd logreduce
+  python3 setup.py develop --user
+  popd
+
+
 * Pip:
 
 .. code-block:: console
@@ -151,7 +163,7 @@
 * logreduce-server: the REST and Gearman server
 * logreduce-worker: job executor
 * logreduce-client: client cli
-* logreduce-ui: web ui
+* logreduce-webui: logreduce web interface
 
 API
 ...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/doc/index.rst 
new/logreduce-0.4.0/doc/index.rst
--- old/logreduce-0.3.0/doc/index.rst   2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/doc/index.rst   2018-11-08 08:39:56.000000000 +0100
@@ -44,6 +44,18 @@
   python3 setup.py develop --user
   popd
 
+
+* openSUSE:
+
+.. code-block:: console
+
+  sudo zypper install python3-scikit-learn
+  git clone https://softwarefactory-project.io/r/logreduce
+  pushd logreduce
+  python3 setup.py develop --user
+  popd
+
+
 * Pip:
 
 .. code-block:: console
@@ -151,7 +163,7 @@
 * logreduce-server: the REST and Gearman server
 * logreduce-worker: job executor
 * logreduce-client: client cli
-* logreduce-ui: web ui
+* logreduce-webui: logreduce web interface
 
 API
 ...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/etc/httpd/log-classify.conf 
new/logreduce-0.4.0/etc/httpd/log-classify.conf
--- old/logreduce-0.3.0/etc/httpd/log-classify.conf     1970-01-01 
01:00:00.000000000 +0100
+++ new/logreduce-0.4.0/etc/httpd/log-classify.conf     2018-11-08 
08:39:56.000000000 +0100
@@ -0,0 +1,27 @@
+ProxyVia On
+ProxyRequests Off
+RewriteEngine on
+
+<Directory /var/www/log-classify>
+    Options Indexes SymLinksIfOwnerMatch
+    Require all granted
+    IndexOptions FancyIndexing HTMLTable NameWidth=* SuppressDescription
+</Directory>
+
+Alias /log-classify/datasets /var/www/log-classify/anomalies
+
+<Directory /usr/share/log-classify>
+    DirectoryIndex index.html
+    Require all granted
+    Order allow,deny
+    Allow from all
+</Directory>
+
+Alias /log-classify /usr/share/log-classify
+# Don't rewrite files or directories
+RewriteRule ^/log-classify/api/(.*)$ http://localhost:20004/api/$1 [L,P]
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-f
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-d
+RewriteCond /usr/share/%{REQUEST_FILENAME} !-l
+# Rewrite everything else to index.html to allow html5 state links
+RewriteRule ^/log-classify/.*$ /usr/share/log-classify/index.html [L]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/etc/httpd/logreduce.conf 
new/logreduce-0.4.0/etc/httpd/logreduce.conf
--- old/logreduce-0.3.0/etc/httpd/logreduce.conf        2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/httpd/logreduce.conf        1970-01-01 
01:00:00.000000000 +0100
@@ -1,27 +0,0 @@
-ProxyVia On
-ProxyRequests Off
-RewriteEngine on
-
-<Directory /var/www/logreduce>
-    Options Indexes SymLinksIfOwnerMatch
-    Require all granted
-    IndexOptions FancyIndexing HTMLTable NameWidth=* SuppressDescription
-</Directory>
-
-Alias /log-classify/datasets /var/www/logreduce/anomalies
-
-<Directory /usr/share/log-classify>
-    DirectoryIndex index.html
-    Require all granted
-    Order allow,deny
-    Allow from all
-</Directory>
-
-Alias /log-classify /usr/share/log-classify
-# Don't rewrite files or directories
-RewriteRule ^/log-classify/api/(.*)$ http://localhost:20004/api/$1 [L,P]
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-f
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-d
-RewriteCond /usr/share/log-classify/%{REQUEST_FILENAME} !-l
-# Rewrite everything else to index.html to allow html5 state links
-RewriteRule ^/log-classify/.*$ /usr/share/log-classify/index.html [L]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/etc/logreduce/config.yaml 
new/logreduce-0.4.0/etc/logreduce/config.yaml
--- old/logreduce-0.3.0/etc/logreduce/config.yaml       2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/logreduce/config.yaml       2018-11-08 
08:39:56.000000000 +0100
@@ -13,9 +13,9 @@
   # Where the models are saved locally
   models_folder: /var/lib/logreduce/models
   # Where the archived dataset are stored locally
-  dataset_folder: /var/www/logreduce/anomalies
+  dataset_folder: /var/www/log-classify/anomalies
   # Where the logs are expected or downloaded
-  logserver_folder: /var/www/logreduce/logs
+  logserver_folder: /var/www/log-classify/logs
 logging:
   loggers:
     logreduce:
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/etc/systemd/logreduce-server.service 
new/logreduce-0.4.0/etc/systemd/logreduce-server.service
--- old/logreduce-0.3.0/etc/systemd/logreduce-server.service    2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/systemd/logreduce-server.service    2018-11-08 
08:39:56.000000000 +0100
@@ -7,8 +7,7 @@
 User=logreduce
 Group=logreduce
 SyslogIdentifier=logreduce-server
-EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3
-ExecStart=/opt/rh/rh-python35/root/usr/bin/logreduce-server -c 
/etc/opt/rh/rh-python35/logreduce/config.yaml
+ExecStart=/usr/bin/logreduce-server -c /etc/logreduce/config.yaml
 
 [Install]
 WantedBy=multi-user.target
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/etc/systemd/logreduce-worker.service 
new/logreduce-0.4.0/etc/systemd/logreduce-worker.service
--- old/logreduce-0.3.0/etc/systemd/logreduce-worker.service    2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/etc/systemd/logreduce-worker.service    2018-11-08 
08:39:56.000000000 +0100
@@ -7,8 +7,7 @@
 User=logreduce
 Group=logreduce
 SyslogIdentifier=logreduce-worker
-EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3
-ExecStart=/opt/rh/rh-python35/root/usr/bin/logreduce-worker -c 
/etc/opt/rh/rh-python35/logreduce/config.yaml
+ExecStart=/usr/bin/logreduce-worker -c /etc/logreduce/config.yaml
 
 [Install]
 WantedBy=multi-user.target
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/cmd.py 
new/logreduce-0.4.0/logreduce/cmd.py
--- old/logreduce-0.3.0/logreduce/cmd.py        2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/cmd.py        2018-11-08 08:39:56.000000000 
+0100
@@ -41,7 +41,6 @@
             parser.print_help()
             exit(4)
         logreduce.utils.setup_logging(args.debug)
-        self.model_type = "hashing_nn"
         self.job = None
         self.exclude_file = logreduce.utils.DEFAULT_IGNORE_FILES
         self.exclude_path = logreduce.utils.DEFAULT_IGNORE_PATHS
@@ -87,6 +86,9 @@
         parser.add_argument("--tmp-dir", default=os.getcwd())
         parser.add_argument("--cacheonly", action="store_true",
                             help="Do not download any logs")
+        parser.add_argument("--model-type", default="hashing_nn",
+                            choices=list(models.keys()),
+                            help="The model type")
 
         # Common arguments
         def path_filters(s):
@@ -117,6 +119,8 @@
 
         def report_filters(s):
             s.add_argument("--html", metavar="FILE", help="Render html result")
+            s.add_argument("--json", metavar="FILE",
+                           help="Optional json output")
             s.add_argument("--static-location",
                            help="The js/css static directory location")
             s.add_argument("--threshold", default=0.2, type=float,
@@ -137,9 +141,6 @@
         def model_filters(s):
             s.add_argument("--max-age", type=int, default=7,
                            help="Maximum age of a model")
-            s.add_argument("--model-type", default="hashing_nn",
-                           choices=list(models.keys()),
-                           help="The model type")
 
         def journal_filters(s):
             s.add_argument("--range", choices=("day", "week", "month"),
@@ -158,7 +159,7 @@
             s.set_defaults(func=self.model_run)
             path_filters(s)
             report_filters(s)
-            s.add_argument("model_file", metavar="FILE")
+            s.add_argument("model_file")
             s.add_argument("target", nargs='+')
 
         # Local directory
@@ -256,8 +257,6 @@
             s = sub.add_parser("diff", help="Compare directories/files")
             s.set_defaults(func=self.diff)
             report_filters(s)
-            s.add_argument("--json", metavar="FILE",
-                           help="Optional json output")
             s.add_argument("baseline", nargs='+')
             s.add_argument("target")
 
@@ -439,7 +438,7 @@
     def diff(self, baseline, target):
         clf = self._get_classifier()
         clf.train(baseline)
-        self._report(clf, target, json_file=self.json)
+        self._report(clf, target)
 
     def download_logs(self, logs_url, target_dir=None):
         if logs_url.endswith("/job-output.txt.gz"):
@@ -486,13 +485,13 @@
         clf.include_path = self.include_path
         return clf
 
-    def _report(self, clf, target_dirs, target_source=None, json_file=None):
+    def _report(self, clf, target_dirs, target_source=None):
         if self.context_length is not None:
             self.before_context = self.context_length
             self.after_context = self.context_length
 
         console_output = True
-        if json_file or self.html:
+        if self.json or self.html:
             console_output = False
         output = clf.process(path=target_dirs,
                              path_source=target_source,
@@ -508,8 +507,8 @@
                 render_html(output, self.static_location))
             open(self.html.replace(".html", ".json"), "w").write(
                 json.dumps(output))
-        if json_file is not None:
-            open(json_file, "w").write(json.dumps(output))
+        if self.json:
+            open(self.json, "w").write(json.dumps(output))
         else:
             print("%02.2f%% reduction (from %d lines to %d)" % (
                 output["reduction"],
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/models.py 
new/logreduce-0.4.0/logreduce/models.py
--- old/logreduce-0.3.0/logreduce/models.py     2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/models.py     2018-11-08 08:39:56.000000000 
+0100
@@ -13,8 +13,9 @@
 import os
 import warnings
 
+import numpy as np
+
 from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.neighbors import LSHForest
 from sklearn.neighbors import NearestNeighbors
 from sklearn.feature_extraction.text import HashingVectorizer
 # from sklearn import svm
@@ -55,23 +56,20 @@
         return [0.5] * len(test_data)
 
 
-class LSHF(Model):
-    """Forest model, faster for on large index (>20000 samples)"""
+class SimpleNeighbors(Model):
+    """Simple NN model"""
     def __init__(self, name=""):
         super().__init__(name)
         self.vectorizer = TfidfVectorizer(
-            analyzer='word', lowercase=False, tokenizer=None,
+            analyzer=str.split, lowercase=False, tokenizer=None,
             preprocessor=None, stop_words=None)
-
-        self.lshf = LSHForest(
-            random_state=int(os.environ.get("LR_RANDOM_STATE", 42)),
-            n_estimators=int(os.environ.get("LR_N_ESTIMATORS", 23)))
+        self.nn = NearestNeighbors(
+            algorithm='brute',
+            metric='cosine')
 
     def train(self, train_data):
-        with warnings.catch_warnings():
-            warnings.simplefilter("ignore")
-            dat = self.vectorizer.fit_transform(train_data)
-            self.lshf.fit(dat)
+        dat = self.vectorizer.fit_transform(train_data)
+        self.nn.fit(dat)
         self.info = "%d samples, %d features" % dat.shape
         return dat
 
@@ -82,24 +80,31 @@
                 chunk = test_data[chunk_pos:min(len(test_data),
                                                 chunk_pos + CHUNK_SIZE)]
                 dat = self.vectorizer.transform(chunk)
-                distances, _ = self.lshf.kneighbors(dat, n_neighbors=1)
+                distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
                 all_distances.extend(distances)
         return all_distances
 
 
-class SimpleNeighbors(Model):
-    """Simple NN model"""
+class HashingNeighbors(Model):
+    """ HashingVectorized NN model.
+    Fastest implementation for low sample sizes (<1e5),
+    logreduce-tests benchmark: 12sec
+    """
     def __init__(self, name=""):
         super().__init__(name)
-        self.vectorizer = TfidfVectorizer(
-            analyzer='word', lowercase=False, tokenizer=None,
+        self.vectorizer = HashingVectorizer(
+            binary=True, n_features=2**18,
+            analyzer=str.split, lowercase=False, tokenizer=None,
             preprocessor=None, stop_words=None)
+        # HashingVectorizer produces sparse vectors, and
+        # sorted(sklearn.neighbors.VALID_METRICS_SPARSE['algorithm']) is
+        # empty for anything != brute
         self.nn = NearestNeighbors(
-            algorithm='brute',
-            metric='cosine')
+            algorithm='brute', metric='cosine',
+            n_jobs=1, n_neighbors=1)
 
     def train(self, train_data):
-        dat = self.vectorizer.fit_transform(train_data)
+        dat = self.vectorizer.transform(train_data)
         self.nn.fit(dat)
         self.info = "%d samples, %d features" % dat.shape
         return dat
@@ -111,46 +116,105 @@
                 chunk = test_data[chunk_pos:min(len(test_data),
                                                 chunk_pos + CHUNK_SIZE)]
                 dat = self.vectorizer.transform(chunk)
-                distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
+                distances, _ = self.nn.kneighbors(dat)
                 all_distances.extend(distances)
         return all_distances
 
 
-class HashingNeighbors(Model):
-    """Simple NN model"""
-    # True random words
+class HashingApproximateNeighbors(Model):
+    """ Approximate Nearest Neighbor Search.
+    This implementation is rather slow, logreduce-tests benchmark: 60sec.
+    The code may be optimized to not record training data since we don't care
+    what the actual neighbor is, and it should simply return distance as float
+    and not str.
+
+    TODO: benchmark with higher sample size.
+    """
     def __init__(self, name=""):
         super().__init__(name)
         self.vectorizer = HashingVectorizer(
             binary=True,
-            analyzer='word', lowercase=False, tokenizer=None,
+            analyzer=str.split, lowercase=False, tokenizer=None,
             preprocessor=None, stop_words=None)
-        self.nn = NearestNeighbors(algorithm='brute', metric='cosine')
 
     def train(self, train_data):
+        try:
+            import pysparnn.cluster_index as ci
+        except ImportError:
+            raise RuntimeError("Install this dependency to use this model: "
+                               "https://github.com/facebookresearch/pysparnn";)
+        train_data = list(train_data)
         dat = self.vectorizer.transform(train_data)
-        self.nn.fit(dat)
-        self.info = "%d samples, %d features" % dat.shape
-        return dat
+        self.nn = ci.MultiClusterIndex(dat, train_data)
+        self.info = ''
 
     def test(self, test_data):
         all_distances = []
-        with warnings.catch_warnings():
-            for chunk_pos in range(0, len(test_data), CHUNK_SIZE):
-                chunk = test_data[chunk_pos:min(len(test_data),
-                                                chunk_pos + CHUNK_SIZE)]
-                dat = self.vectorizer.transform(chunk)
-                distances, _ = self.nn.kneighbors(dat, n_neighbors=1)
-                all_distances.extend(distances)
+        for chunk_pos in range(0, len(test_data), CHUNK_SIZE):
+            chunk = test_data[chunk_pos:min(len(test_data),
+                                            chunk_pos + CHUNK_SIZE)]
+            dat = self.vectorizer.transform(chunk)
+            distances = self.nn.search(
+                dat, k=1, k_clusters=2, return_distance=True)
+            # Work around str format of distance...
+            for distance in distances:
+                if distance[0][0].startswith('-'):
+                    all_distances.append([0.0])
+                    continue
+                all_distances.append([float(distance[0][0])])
         return all_distances
 
-    def process_line(self, line):
-        return Tokenizer.process(line)
+
+class HashingAnnoy(Model):
+    """HashingAnnoy NN model.
+    logreduce-tests FAILED: 85.66% accuracy, 21.84% false-positive,
+    logreduce-tests benchmark: 56sec
+
+    TODO: test and benchmark with higher sample size.
+    """
+    def __init__(self, name=""):
+        try:
+            from annoy import AnnoyIndex
+        except ImportError:
+            raise RuntimeError("Install annoy library first")
+        super().__init__(name)
+        features = 2**13
+        self.vectorizer = HashingVectorizer(
+            binary=True, n_features=features,
+            analyzer=str.split, lowercase=False, tokenizer=None,
+            preprocessor=None, stop_words=None)
+        self.nn = AnnoyIndex(features)
+
+    def train(self, train_data):
+        dat = self.vectorizer.transform(train_data)
+        for idx in range(len(train_data)):
+            self.nn.add_item(idx, dat[idx].toarray()[0])
+        self.nn.build(10)  # n trees
+        self.info = "%d samples, %d features" % dat.shape
+        return dat
+
+    def test(self, test_data):
+        all_distances = []
+        dat = self.vectorizer.transform(test_data)
+        for v in dat:
+            d = self.nn.get_nns_by_vector(
+                v.toarray()[0], 1, include_distances=True)
+            all_distances.append([d[1][0]])
+        # normalize
+        # l1
+        # norm = np.sum(all_distances)
+        # l2
+        norm = np.sqrt(np.sum(np.square(all_distances)))
+        normalized_distances = all_distances / norm
+        # Scores are much lower, increase artificially here for now
+        normalized_distances *= 2
+        return normalized_distances
 
 
 models = {
-    'bag-of-words_lshf': LSHF,
     'bag-of-words_nn': SimpleNeighbors,
     'hashing_nn': HashingNeighbors,
+    'hashing_ann': HashingApproximateNeighbors,
+    'hashing_annoy': HashingAnnoy,
     'noop': Noop,
 }
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/process.py 
new/logreduce-0.4.0/logreduce/process.py
--- old/logreduce-0.3.0/logreduce/process.py    2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/process.py    2018-11-08 08:39:56.000000000 
+0100
@@ -34,7 +34,9 @@
 
 class Classifier:
     log = logging.getLogger("logreduce.Classifier")
-    version = 4
+    # Bump this version when models created with earlier versions
+    # should be rejected
+    version = 5
 
     def __init__(self,
                  model='bag-of-words_nn', exclude_paths=[], exclude_files=[]):
@@ -112,6 +114,13 @@
         # Remove numbers and symbols
         return re.subn(r'[^a-zA-Z\/\._-]*', '', shortfilename)[0]
 
+    @staticmethod
+    def _is_log_classify_invocation(model_name, line):
+        """ Returns True if the line is related to log-classify"""
+        return model_name == "job-output.txt" and (
+            "TASK [log-classify " in line or
+            "TASK [Generate ara report]" in line)
+
     def train(self, baselines, command=sys.argv):
         """Train the model, baselines can be path(s) or build dict(s)"""
         start_time = time.monotonic()
@@ -147,36 +156,29 @@
             model.size = 0
             model.count = 0
             model.uuid = str(uuid.uuid4())
-            # Tokenize and store all lines in train_data
-            train_data = []
+            # Tokenize and store all de-duplicated lines in train_data
+            train_data = set()
             for filename in filenames:
                 self.log.debug("%s: Loading %s" % (model_name, filename))
                 fobj = None
                 try:
                     fobj = open_file(filename)
-                    idx = 0
-                    while True:
-                        line = fobj.readline()
-                        if line == b'':
-                            break
+                    for line in fobj:
                         line = line.decode('ascii', errors='ignore')
                         # Special case to not train ourself
-                        if model_name == "job-output.txt" and (
-                                "TASK [log-classify " in line or
-                                "TASK [Generate ara report]" in line):
+                        if self._is_log_classify_invocation(model_name, line):
                             break
                         # Remove ansible std_lines list now
                         line = remove_ansible_std_lines_lists(line)
                         for sub_line in line.split(r'\r'):
                             sub_line = model.process_line(sub_line)
                             if sub_line:
-                                train_data.append(sub_line)
-                        idx += 1
+                                train_data.add(sub_line)
+                        model.count += 1
                     try:
                         model.size += os.stat(filename).st_size
                     except TypeError:
                         pass
-                    model.count += idx
                 except KeyboardInterrupt:
                     exit(1)
                 except Exception:
@@ -203,14 +205,19 @@
 
             self.training_lines_count += model.count
             self.training_size += model.size
+            train_data_time = time.monotonic() - model_start_time
+            self.log.debug(
+                "%s: Parsing took %s", model_name,
+                format_speed(model.count, model.size, train_data_time))
             try:
                 # Transform and fit the model data
+                train_start_time = time.monotonic()
                 model = self.get(model_name)
                 model.train(train_data)
-                model.train_time = time.monotonic() - model_start_time
+                model.train_time = time.monotonic() - train_start_time
 
-                self.log.debug("%s: %s %s" % (
-                    model_name, model.info,
+                self.log.debug("%s: Fitting took %s" % (
+                    model_name,
                     format_speed(model.count, model.size, model.train_time)))
             except ValueError:
                 self.log.exception("%s: couldn't train with %s" % (model_name,
@@ -291,15 +298,10 @@
             try:
                 fobj = open_file(filename)
                 idx = 0
-                while True:
-                    line = fobj.readline()
-                    if line == b'':
-                        break
+                for line in fobj:
                     line = line.decode('ascii', errors='ignore')
                     # Special case to not test ourself
-                    if model_name == "job-output.txt" and (
-                            "TASK [log-classify " in line or
-                            "TASK [Generate ara report]" in line):
+                    if self._is_log_classify_invocation(model_name, line):
                         break
                     # Remove ansible std_lines list now
                     line = remove_ansible_std_lines_lists(line)
@@ -362,8 +364,7 @@
             outliers = []
             last_outlier = 0
             remaining_after_context = 0
-            line_pos = 0
-            while line_pos < len(data):
+            for line_pos in range(len(data)):
                 distance, line = get_line_info(line_pos)
                 if distance >= self.threshold:
                     if line_pos - last_outlier >= self.merge_distance:
@@ -383,7 +384,6 @@
                     outliers.append((line_pos, distance, line))
                     remaining_after_context -= 1
                     last_outlier = line_pos
-                line_pos += 1
 
             # Yield result
             yield (filename_rel, filename_orig, model, outliers,
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/server/api.py 
new/logreduce-0.4.0/logreduce/server/api.py
--- old/logreduce-0.3.0/logreduce/server/api.py 2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/server/api.py 2018-11-08 08:39:56.000000000 
+0100
@@ -65,7 +65,8 @@
         """Return the anomalies list"""
         results = []
         with self.db.session() as session:
-            for anomaly in session.query(model.Anomaly):
+            for anomaly in (session.query(model.Anomaly)
+                            .order_by(model.Anomaly.report_date.desc())):
                 results.append({
                     'uuid': anomaly.uuid,
                     'name': anomaly.name,
@@ -75,7 +76,6 @@
                     'build': anomaly.build.toDict()
                 })
         cherrypy.response.headers['Access-Control-Allow-Origin'] = '*'
-        results.reverse()
         return results
 
     def _getAnomaly(self, session, anomaly_id):
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_api.py 
new/logreduce-0.4.0/logreduce/tests/test_api.py
--- old/logreduce-0.3.0/logreduce/tests/test_api.py     2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_api.py     2018-11-08 
08:39:56.000000000 +0100
@@ -22,7 +22,7 @@
 import logreduce.server.client
 import logreduce.server.rpc as rpc
 
-from . utils import fake_build_result
+from . utils import fake_build_result, find_free_port
 
 logging.basicConfig(level=logging.DEBUG)
 
@@ -31,7 +31,7 @@
     @classmethod
     def setup_class(cls):
         cls.tmpfile = tempfile.mkstemp()[1]
-        cls.gearman = {'addr': '0.0.0.0', 'port': 4742}
+        cls.gearman = {'addr': '0.0.0.0', 'port': find_free_port()}
         cls.gear = rpc.Server(**cls.gearman)
         cls.gear.start()
         cls.downloadLog = []
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_download.py 
new/logreduce-0.4.0/logreduce/tests/test_download.py
--- old/logreduce-0.3.0/logreduce/tests/test_download.py        2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_download.py        2018-11-08 
08:39:56.000000000 +0100
@@ -44,4 +44,4 @@
             })
         mock_request.return_value = MockResponse(json.dumps(fake_builds))
         zb = logreduce.download.ZuulBuilds("http://zuul.example.com/api";)
-        self.assertEquals(3, len(zb.get(result="SUCCESS")))
+        self.assertEqual(3, len(zb.get(result="SUCCESS")))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_model.py 
new/logreduce-0.4.0/logreduce/tests/test_model.py
--- old/logreduce-0.3.0/logreduce/tests/test_model.py   2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_model.py   2018-11-08 
08:39:56.000000000 +0100
@@ -79,4 +79,4 @@
             anomaly_uuid = self.db.import_report(session, report)
 
             anomaly = session.query(model.Anomaly).get(anomaly_uuid)
-            self.assertEquals("check", anomaly.build.pipeline)
+            self.assertEqual("check", anomaly.build.pipeline)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/test_units.py 
new/logreduce-0.4.0/logreduce/tests/test_units.py
--- old/logreduce-0.3.0/logreduce/tests/test_units.py   2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/test_units.py   2018-11-08 
08:39:56.000000000 +0100
@@ -17,11 +17,74 @@
 
 
 class TokenizerTests(unittest.TestCase):
+    def check_expected(self, tests):
+        for raw_line, tokens_out in tests.items():
+            self.assertEqual(
+                tokens_out, Tokenizer.process(raw_line))
+
     def test_random_words(self):
         tokens = Tokenizer.process("Created interface: br-42")
         self.assertNotIn("br-42", tokens)
         tokens = Tokenizer.process("Instance 0xdeadbeef42 created")
-        self.assertEquals("Instance created", tokens)
+        self.assertEqual("Instance created", tokens)
+
+    def test_hash_tokenizing(self):
+        self.check_expected({
+            'Accepted publickey: RSA '
+            'SHA256:UkrwIX8QHA4B2Bny0XHyqgSXM7wFMQTEDtT+PpY9Ep4':
+            'Accepted publickey RNGH',
+            # This used to match 'jan' -> DATE
+            'SHA256:FePTgARR5A3kxb2GJa0QAWjanaI2q+TvneBxzHNqbTA zuul@ze03':
+            'RNGH zuul'
+        })
+
+    def test_ipv6_tokenizing(self):
+        self.check_expected({
+            'mysql+pymysql://root:secretdatabase@[::1]/cinder?"':
+            'mysql pymysql //root secretdatabase RNGI /cinder',
+            'listen_port fe80::f816:3eff:fe47:5142':
+            'listen_port RNGI',
+            'listen_port FE80::F816:3eff:fe47:5142':
+            'listen_port RNGI',
+            'listen_port ::8888':
+            'listen_port RNGI'
+        })
+
+    def test_date_non_tokenizing(self):
+        """Tests that should not match the DATE verb"""
+        self.check_expected({
+            'keys randomart image':
+            'keys randomart image',
+            'Start zuul_console daemon':
+            'Start zuul_console daemon',
+        })
+
+    def test_uuid_words(self):
+        self.check_expected({
+            '| 0473427f-f505-4b50-bc70-72fb6d74568a | vmname | SHUTOFF | -   '
+            '       | Shutdown    | fixed=192.168.123.3 |':
+            'RNGU vmname SHUTOFF Shutdown fixed RNGI',
+            '"UndercloudServiceChain-2kbhkd45kcs3-ServiceChain-54rklv3rnxhe" ':
+            'UndercloudServiceChain HEATID ServiceChain HEATID'
+        })
+
+    def test_non_uuid_words(self):
+        self.check_expected({
+            'dnsmasq-dhcp[31216]: DHCPRELEASE':
+            'dnsmasq dhcp DHCPRELEASE',
+        })
+
+    def test_digits_tokenizing(self):
+        self.check_expected({
+            'Started Session 2677 of user root':
+            'Started Session user root',
+            'Instance 0xdeadbeef42 created':
+            'Instance created',
+            'systemd[4552]: Startup finished in 28ms.':
+            'systemd Startup finished',
+            '764928K 33%  469M 3.05s':
+            ''
+        })
 
     def test_filename2modelname(self):
         for fname, modelname in (
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tests/utils.py 
new/logreduce-0.4.0/logreduce/tests/utils.py
--- old/logreduce-0.3.0/logreduce/tests/utils.py        2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce/tests/utils.py        2018-11-08 
08:39:56.000000000 +0100
@@ -10,6 +10,16 @@
 # License for the specific language governing permissions and limitations
 # under the License.
 
+import socket
+from contextlib import closing
+
+
+def find_free_port():
+    with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
+        s.bind(('', 0))
+        return s.getsockname()[1]
+
+
 fake_result = {
     'anomalies_count': 18,
     'baselines': ['test_process.py'],
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/tokenizer.py 
new/logreduce-0.4.0/logreduce/tokenizer.py
--- old/logreduce-0.3.0/logreduce/tokenizer.py  2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/tokenizer.py  2018-11-08 08:39:56.000000000 
+0100
@@ -1,3 +1,5 @@
+# Copyright 2018 Red Hat, Inc.
+# Copyright 2018 SUSE Linux GmbH.
 # Licensed under the Apache License, Version 2.0 (the "License"); you may
 # not use this file except in compliance with the License. You may obtain
 # a copy of the License at
@@ -21,45 +23,14 @@
 
 UUID_RE = r'[0-9a-f]{8}-?[0-9a-f]{4}-?[0-9a-f]{4}-?[0-9a-f]{4}-' \
           '?[0-9a-f]{12}'
-
 IPV4_RE = r'(([01]?[0-9]?[0-9]|2[0-4][0-9]|2[5][0-5])\.){3}' \
           r'([01]?[0-9]?[0-9]|2[0-4][0-9]|2[5][0-5])'
-# TODO: simplify this if possible...
-IPV6_RE = (r'(?:(?:[0-9A-Fa-f]{1,4}:){6}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|'
-           r'1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}(?:[0-9]|[1-9][0-9]|'
-           r'1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'::(?:[0-9A-Fa-f]{1,4}:){5}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}'
-           r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}'
-           r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:(?:[0-9A-Fa-f]{1,4}:){,2}[0-9A-Fa-f]{1,4})?::'
-           r'(?:[0-9A-Fa-f]{1,4}:){2}(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:(?:[0-9A-Fa-f]{1,4}:){,3}[0-9A-Fa-f]{1,4})?::'
-           r'[0-9A-Fa-f]{1,4}:(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:(?:[0-9A-Fa-f]{1,4}:){,4}[0-9A-Fa-f]{1,4})?::'
-           r'(?:[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}'
-           r'(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|'
-           r'(?:(?:[0-9A-Fa-f]{1,4}:){,5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|'
-           r'(?:(?:[0-9A-Fa-f]{1,4}:){,6}[0-9A-Fa-f]{1,4})?::)')
-MAC_RE = r'([0-9A-F]{2}[:-]){5}([0-9A-F]{2})'
+IPV6_RE = r'([0-9A-Fa-f]{0,4}:){2,6}(\d{1,3}\.){0,3}\d{1,3}'
+MAC_RE = r'([0-9a-fA-F]{2}[:-]){5}([0-9a-fA-F]{2})'
 
 
 class Tokenizer:
     rawline_re = re.compile(
-        r'('
         # useless http GET
         r'"GET / HTTP/1.1"'
         r'|"OPTIONS * HTTP/1.0" 200'
@@ -81,35 +52,30 @@
         r'|unix_chkpwd.*: password check failed for user'
         r'|sshd.*: authentication failure'
         r'|sshd.*: Failed password for'
+        r'|sshd.*- POSSIBLE BREAK-IN ATTEMPT'
         # zuul random test
         r'|zuul.*echo BECOME-SUCCESS-'
         r'|^[^ ]{64}$'
         # useless debug statement
         r'|ovs-ofctl .* (dump-ports|dump-flows|show)\b'
         r'|(ip|eb)tables .* -L\b'
-        r')')
-    ip_re = re.compile(r'(%s|%s|%s)' % (IPV4_RE, IPV6_RE, MAC_RE), re.I)
-    power2_re = re.compile(r'([0-9a-f]{128}|[0-9a-f+/]{64}|'
-                           '[0-9a-f]{40}|[0-9a-f]{32})', re.I)
-    uuid_re = re.compile(r'(%s|tx[^ ]{32})' % UUID_RE, re.I)
-    date_re = re.compile('(%s|%s|%s|%s)' % (DAYS, SHORT_DAYS,
-                                            SHORT_MONTHS, MONTHS), re.I)
-    heat_re = re.compile("-[^ -]{12}[- $]", re.I)
-    comments = re.compile(r'([\s]*# |^%% |^#|^[\s]*id = ").*')
-    alpha_re = re.compile(r'[^a-zA-Z_\/\s]')
-    gitver_re = re.compile(r'git[a-z0-9]+', re.I)
-    digits_re = re.compile(r'(0x[0-9a-f]+|[0-9])', re.I)
-    randpath_re = re.compile(r'('
-                             r'/tmp/ansible\.[a-z0-9_]{8}'
-                             r'|/tmp/tmp[a-z0-9_]{6}'
-                             r'|/tmp/tmp.[a-z0-9]{10}'
-                             r')', re.I)
-    gitsha_re = re.compile(r'('
-                           r'[a-z0-9]{7}\.\.[a-z0-9]{7}'
-                           r')', re.I)
-    hash_re = re.compile(r'('
-                         r'SHA256:[a-z0-9+/]{43} '
-                         r')', re.I)
+    )
+    ip_re = re.compile(r'%s|%s|%s' % (IPV4_RE, IPV6_RE, MAC_RE))
+    power2_re = re.compile(r'\b(?:[\w+/]{128}|[\w+/]{64}|'
+                           r'[0-9a-fA-F]{40}|[0-9a-fA-F]{32})\b')
+    uuid_re = re.compile(r'\b(?:%s|tx[^ ]{32})\b' % UUID_RE, re.I)
+    date_re = re.compile(r'\b(?:%s|%s|%s|%s)\b' % (DAYS, SHORT_DAYS,
+                                                   SHORT_MONTHS, MONTHS), re.I)
+    heat_re = re.compile(r'-\w{12}[- \"$]')
+    comments = re.compile(r'(?:[\s]*# |^%% |^#|^[\s]*id = ").*')
+    alpha_re = re.compile(r'[^a-zA-Z_\/\s]+')
+    gitver_re = re.compile(r'git\w+')
+    digits_re = re.compile(r'0x[0-9a-fA-F]{2,}|[0-9]+(?:\.\d+)?')
+    randpath_re = re.compile(r'(?:/tmp/ansible\.\w{8}'
+                             r'|/tmp/tmp\w{6}'
+                             r'|/tmp/tmp\.\w{10})\b')
+    gitsha_re = re.compile(r'\b\w{7}\.\.\w{7}\b')
+    hash_re = re.compile(r'SHA256:[\w+/]{43}\b')
 
     @staticmethod
     def process(line):
@@ -118,24 +84,26 @@
             return ''
         strip = line
         # Remove words that are exactly 32, 64 or 128 character longs
-        strip = Tokenizer.power2_re.subn("RNGN", strip)[0]
+        strip = Tokenizer.power2_re.sub("RNGN", strip)
         # Remove uuid
-        strip = Tokenizer.heat_re.subn(" HEAT ", strip)[0]
-        strip = Tokenizer.uuid_re.subn("RNGU", strip)[0]
-        # Remove date
-        strip = Tokenizer.date_re.subn("DATE", strip)[0]
+        strip = Tokenizer.uuid_re.sub("RNGU", strip)
+        # Remove heat short uuid but keep spacing
+        #  ObjectName-2kbhkd45kcs3-ServiceName -> ObjectName-HEATID-ServiceName
+        strip = Tokenizer.heat_re.sub(" HEATID ", strip)
         # Remove git sha
-        strip = Tokenizer.gitsha_re.subn("RNGG", strip)[0]
+        strip = Tokenizer.gitsha_re.sub("RNGG", strip)
         # Remove hashes
-        strip = Tokenizer.hash_re.subn("RNGH", strip)[0]
+        strip = Tokenizer.hash_re.sub("RNGH", strip)
         # Remove random path
-        strip = Tokenizer.randpath_re.subn("RNGP", strip)[0]
+        strip = Tokenizer.randpath_re.sub("RNGP", strip)
+        # Remove date
+        strip = Tokenizer.date_re.sub("DATE", strip)
         # Remove ip/addr
-        strip = Tokenizer.ip_re.subn("RNGI", strip)[0]
+        strip = Tokenizer.ip_re.sub("RNGI", strip)
         # Remove numbers
-        strip = Tokenizer.digits_re.subn("", strip)[0]
+        strip = Tokenizer.digits_re.sub("", strip)
         # Only keep characters
-        strip = Tokenizer.alpha_re.subn(" ", strip)[0]
+        strip = Tokenizer.alpha_re.sub(" ", strip)
         # Remove tiny words
         strip = " ".join(filter(lambda x: len(x) > 3, strip.split()))
         # Weight failure token
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce/utils.py 
new/logreduce-0.4.0/logreduce/utils.py
--- old/logreduce-0.3.0/logreduce/utils.py      2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/logreduce/utils.py      2018-11-08 08:39:56.000000000 
+0100
@@ -18,10 +18,11 @@
 import sqlite3
 import zlib
 import json
+import datetime
+import time
+
 try:
     from systemd import journal
-    import datetime
-    import time
     journal_installed = True
 except ImportError:
     journal_installed = False
@@ -42,6 +43,7 @@
     "etc/systemd/",
     "etc/polkit-1/",
     "etc/pki/",
+    "etc/swift/.*\.builder",
     "group_vars/all.yaml",
     "keystone/credential-keys",
     "keystone/fernet-keys",
@@ -100,35 +102,36 @@
 ]
 
 BLACKLIST_EXTENSIONS = (
-    ".sqlite",
-    ".svg",
-    ".woff",
-    ".ttf",
+    ".conf",
+    ".conf.txt",
+    ".crt",
+    ".csr",
     ".css",
-    ".js",
     ".db",
     ".ico",
+    ".journal",
+    ".js",
+    ".json",
+    ".json.txt",
+    "_key",
+    ".key",
+    ".pem",
     ".png",
-    ".tgz",
     ".pyc",
     ".pyo",
-    ".so",
-    ".key",
-    "_key",
-    ".crt",
-    ".csr",
-    ".pem",
+    "ring.gz",
     ".rpm",
+    ".so",
+    ".sqlite",
     ".subunit",
-    ".journal",
-    ".json",
-    ".json.txt",
-    ".yaml.txt",
-    ".conf",
-    ".conf.txt",
+    ".svg",
+    ".tgz",
+    ".ttf",
+    ".woff",
+    ".xml",
     ".yaml",
+    ".yaml.txt",
     ".yml",
-    "ring.gz",
 )
 
 FACILITY2NAME = {
@@ -190,11 +193,14 @@
         self.journal.close()
         del self.journal
 
-    def readline(self):
+    def __iter__(self):
+        return self
+
+    def __next__(self):
         entry = self.journal.get_next()
         ts = entry.get('__REALTIME_TIMESTAMP', datetime.datetime(1970, 1, 1))
         if not entry or (self.until and ts.timestamp() > self.until):
-            return b''
+            raise StopIteration
         facility = entry.get('SYSLOG_FACILITY')
         if isinstance(facility, int):
             entry['LEVEL'] = FACILITY2NAME.get(facility, 'NOTI').upper()
@@ -216,9 +222,12 @@
         self.lines = []
         self.idx = 0
 
-    def readline(self):
+    def __iter__(self):
+        return self
+
+    def __next__(self):
         if self.idx >= len(self.lines):
-            return b''
+            raise StopIteration
         self.idx += 1
         return self.lines[self.idx - 1].encode('utf-8')
 
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/PKG-INFO 
new/logreduce-0.4.0/logreduce.egg-info/PKG-INFO
--- old/logreduce-0.3.0/logreduce.egg-info/PKG-INFO     2018-10-25 
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/PKG-INFO     2018-11-08 
08:40:11.000000000 +0100
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: logreduce
-Version: 0.3.0
+Version: 0.4.0
 Summary: Extract anomalies from log files
 Home-page: https://logreduce.softwarefactory-project.io/
 Author: Tristan Cacqueray
@@ -52,6 +52,18 @@
           python3 setup.py develop --user
           popd
         
+        
+        * openSUSE:
+        
+        .. code-block:: console
+        
+          sudo zypper install python3-scikit-learn
+          git clone https://softwarefactory-project.io/r/logreduce
+          pushd logreduce
+          python3 setup.py develop --user
+          popd
+        
+        
         * Pip:
         
         .. code-block:: console
@@ -159,7 +171,7 @@
         * logreduce-server: the REST and Gearman server
         * logreduce-worker: job executor
         * logreduce-client: client cli
-        * logreduce-ui: web ui
+        * logreduce-webui: logreduce web interface
         
         API
         ...
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/SOURCES.txt 
new/logreduce-0.4.0/logreduce.egg-info/SOURCES.txt
--- old/logreduce-0.3.0/logreduce.egg-info/SOURCES.txt  2018-10-25 
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/SOURCES.txt  2018-11-08 
08:40:11.000000000 +0100
@@ -11,7 +11,7 @@
 tox.ini
 doc/conf.py
 doc/index.rst
-etc/httpd/logreduce.conf
+etc/httpd/log-classify.conf
 etc/logreduce/config.yaml
 etc/systemd/logreduce-server.service
 etc/systemd/logreduce-worker.service
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.egg-info/pbr.json 
new/logreduce-0.4.0/logreduce.egg-info/pbr.json
--- old/logreduce-0.3.0/logreduce.egg-info/pbr.json     2018-10-25 
11:23:59.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.egg-info/pbr.json     2018-11-08 
08:40:11.000000000 +0100
@@ -1 +1 @@
-{"git_version": "a7a1da5", "is_release": true}
\ No newline at end of file
+{"git_version": "aa49628", "is_release": true}
\ No newline at end of file
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/logreduce.spec 
new/logreduce-0.4.0/logreduce.spec
--- old/logreduce-0.3.0/logreduce.spec  2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/logreduce.spec  2018-11-08 08:39:56.000000000 +0100
@@ -2,7 +2,7 @@
 %{!?scl:%global pkg_name %{name}}
 
 Name:           %{?scl_prefix}logreduce
-Version:        0.1.0
+Version:        0.3.0
 Release:        2%{?dist}
 Summary:        Extract anomalies from log files
 
@@ -32,7 +32,7 @@
 
 %package server
 Summary:        The logreduce server
-Requires:       %{?scl_prefix}logreduce
+Requires:       %{?scl_prefix}logreduce = %version
 Requires:       %{?scl_prefix}python-alembic
 Requires:       %{?scl_prefix}python-sqlalchemy
 Requires:       %{?scl_prefix}python-cherrypy
@@ -46,7 +46,7 @@
 
 %package worker
 Summary:        The logreduce worker
-Requires:       %{?scl_prefix}logreduce
+Requires:       %{?scl_prefix}logreduce = %version
 Requires:       %{?scl_prefix}python-gear
 
 %description worker
@@ -70,11 +70,17 @@
 %{?scl:scl enable %{scl} - << \EOF}
 PBR_VERSION=%{version} %{__python3} setup.py build
 %{?scl:EOF}
+# TODO: make this replace conditional only when SCL is enabled
 sed -e 's#/var/lib/logreduce#/var/opt/rh/rh-python35/lib/logreduce#' \
     -e 's#/var/log/logreduce#/var/opt/rh/rh-python35/log/logreduce#' \
     -i etc/logreduce/config.yaml
 sed -e 's#/usr/share/#/opt/rh/rh-python35/root/usr/share/#' \
-    -i etc/httpd/logreduce.conf
+    -i etc/httpd/log-classify.conf
+sed -e 's#/usr/bin/#/opt/rh/rh-python35/root/usr/bin/#'        \
+    -e 's#/etc/logreduce/#/etc/opt/rh/rh-python35/logreduce/#' \
+    -e 
's#^ExecStart#EnvironmentFile=-/etc/opt/rh/rh-python35/sysconfig/enable-py3\nExecStart#'
 \
+    -i etc/systemd/logreduce-server.service 
etc/systemd/logreduce-worker.service
+
 pushd web
 ln -s /opt/patternfly-react-ui-deps/node_modules/ node_modules
 PUBLIC_URL="/log-classify/" ./node_modules/.bin/yarn build
@@ -90,27 +96,17 @@
 install -p -D -m 0644 etc/systemd/logreduce-server.service 
%{buildroot}%{_unitdir}/%{?scl_prefix}logreduce-server.service
 install -p -D -m 0644 etc/systemd/logreduce-worker.service 
%{buildroot}%{_unitdir}/%{?scl_prefix}logreduce-worker.service
 install -p -D -m 0644 etc/logreduce/config.yaml 
%{buildroot}%{_sysconfdir}/logreduce/config.yaml
-install -p -D -m 0644 etc/httpd/logreduce.conf 
%{buildroot}/etc/httpd/conf.d/logreduce.conf
+install -p -D -m 0644 etc/httpd/log-classify.conf 
%{buildroot}/etc/httpd/conf.d/log-classify.conf
 install -p -d -m 0700 %{buildroot}%{_sharedstatedir}/logreduce
 install -p -d -m 0700 %{buildroot}%{_localstatedir}/log/logreduce
-install -p -d -m 0755 %{buildroot}/var/www/logreduce/anomalies
-install -p -d -m 0755 %{buildroot}/var/www/logreduce/logs
-
+install -p -d -m 0755 %{buildroot}/var/www/log-classify/anomalies
+install -p -d -m 0755 %{buildroot}/var/www/log-classify/logs
 
-%pre server
-getent group logreduce >/dev/null || groupadd -r logreduce
-if ! getent passwd logreduce >/dev/null; then
-  useradd -r -g logreduce -G logreduce -d %{_sharedstatedir}/logreduce -s 
/sbin/nologin -c "Logreduce Daemon" logreduce
-fi
-exit 0
 
-%pre worker
+%pre
 getent group logreduce >/dev/null || groupadd -r logreduce
-if ! getent passwd logreduce >/dev/null; then
+getent passwd logreduce >/dev/null || \
   useradd -r -g logreduce -G logreduce -d %{_sharedstatedir}/logreduce -s 
/sbin/nologin -c "Logreduce Daemon" logreduce
-fi
-exit 0
-
 
 %post server
 %systemd_post %{?scl_prefix}logreduce-server.service
@@ -127,7 +123,6 @@
 %postun worker
 %systemd_postun %{?scl_prefix}logreduce-worker.service
 
-
 %files
 %license LICENSE
 %doc README.rst
@@ -140,10 +135,10 @@
 
 %files server
 %{_bindir}/logreduce-server
-%config(noreplace) /etc/httpd/conf.d/logreduce.conf
+%config(noreplace) /etc/httpd/conf.d/log-classify.conf
 %{_unitdir}/%{?scl_prefix}logreduce-server.service
-%dir %attr(0755, logreduce, logreduce) /var/www/logreduce/logs
-%dir %attr(0755, logreduce, logreduce) /var/www/logreduce/anomalies
+%dir %attr(0755, logreduce, logreduce) /var/www/log-classify/logs
+%dir %attr(0755, logreduce, logreduce) /var/www/log-classify/anomalies
 
 %files worker
 %{_bindir}/logreduce-worker
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/logreduce-0.3.0/roles/log-classify/defaults/main.yaml 
new/logreduce-0.4.0/roles/log-classify/defaults/main.yaml
--- old/logreduce-0.3.0/roles/log-classify/defaults/main.yaml   2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/roles/log-classify/defaults/main.yaml   2018-11-08 
08:39:56.000000000 +0100
@@ -28,7 +28,7 @@
 # Process console-log
 logclassify_console: true
 # Process ara ansible.sqlite
-logclassify_ara_databae: false
+logclassify_ara_database: false
 
 # Include paths from baseline logs
 logclassify_logserver_dir: ""
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_binsize.py 
new/logreduce-0.4.0/scripts/debug_binsize.py
--- old/logreduce-0.3.0/scripts/debug_binsize.py        2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_binsize.py        2018-11-08 
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
 #
 # Licensed under the Apache License, Version 2.0 (the "License"); you may
 # not use this file except in compliance with the License. You may obtain
@@ -16,8 +16,9 @@
 
 import sys
 from logreduce.utils import files_iterator, open_file
-from logreduce.models import Classifier, Model
-from logreduce.models import remove_ansible_std_lines_lists
+from logreduce.process import Classifier
+from logreduce.models import Model
+from logreduce.tokenizer import remove_ansible_std_lines_lists
 
 try:
     path = sys.argv[1]
@@ -32,17 +33,14 @@
     bag_name = Classifier.filename2modelname(filename_rel)
     groups.setdefault(bag_name, []).append(filename)
 
-model = Model()
+model = Model(bag_name)
 for group_name, files in sorted(groups.items()):
     for filename in files:
         fobj = None
         try:
             fobj = open_file(filename)
             idx = 0
-            while True:
-                line = fobj.readline()
-                if line == b'':
-                    break
+            for line in fobj:
                 line = line.decode('ascii', errors='ignore')
                 # Remove ansible std_lines list now
                 line = remove_ansible_std_lines_lists(line)
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_filename2modelname.py 
new/logreduce-0.4.0/scripts/debug_filename2modelname.py
--- old/logreduce-0.3.0/scripts/debug_filename2modelname.py     2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_filename2modelname.py     2018-11-08 
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
 #
 # Licensed under the Apache License, Version 2.0 (the "License"); you may
 # not use this file except in compliance with the License. You may obtain
@@ -16,7 +16,7 @@
 
 import sys
 from logreduce.utils import files_iterator
-from logreduce.models import Classifier
+from logreduce.process import Classifier
 
 try:
     path = sys.argv[1]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/scripts/debug_lineprocess.py 
new/logreduce-0.4.0/scripts/debug_lineprocess.py
--- old/logreduce-0.3.0/scripts/debug_lineprocess.py    2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/scripts/debug_lineprocess.py    2018-11-08 
08:39:56.000000000 +0100
@@ -1,4 +1,4 @@
-#!/bin/env python3
+#!/usr/bin/env python3
 #
 # Licensed under the Apache License, Version 2.0 (the "License"); you may
 # not use this file except in compliance with the License. You may obtain
@@ -14,15 +14,33 @@
 
 """Script to debug line tokenization"""
 
+from collections import Counter
 import sys
+
 from logreduce.tokenizer import Tokenizer
 
 try:
     path = sys.argv[1]
 except IndexError:
-    print("usage: %s file" % sys.argv[0])
+    print("usage: %s [file]..." % sys.argv[0])
     exit(1)
 
-for line in open(path).readlines():
-    print(line[:-1])
-    print("-> %s" % Tokenizer.process(line))
+tokens_c = Counter()
+word_c = Counter()
+line_set = set()
+for path in sys.argv[1:]:
+    for line in open(path):
+        word_c.update(line.split())
+        tokens = Tokenizer.process(line)
+        tokens_c.update(tokens.split())
+        line = line.rstrip()
+        if line not in line_set and (line != tokens):
+            line_set.add(line)
+            print("  ", line)
+            print("-> %s" % tokens)
+
+print("Total words: %d Total Tokens: %d" % (
+        len(word_c), len(tokens_c)))
+
+print("Top 10 words: %s", word_c.most_common(10))
+print("Top 10 Tokens: %s", tokens_c.most_common(10))
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/test-requirements.txt 
new/logreduce-0.4.0/test-requirements.txt
--- old/logreduce-0.3.0/test-requirements.txt   2018-10-25 11:23:44.000000000 
+0200
+++ new/logreduce-0.4.0/test-requirements.txt   2018-11-08 08:39:56.000000000 
+0100
@@ -1,2 +1,3 @@
 pytest
 mock
+systemd-python
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/tox.ini new/logreduce-0.4.0/tox.ini
--- old/logreduce-0.3.0/tox.ini 2018-10-25 11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/tox.ini 2018-11-08 08:39:56.000000000 +0100
@@ -1,15 +1,16 @@
 [tox]
-envlist = py35,pep8
+envlist = py35,py36,py37,pep8
 minversion = 1.6
 skipsdist = True
 sitepackages = True
 
 [testenv]
-sitepackages = True
 usedevelop = True
 deps = -rtest-requirements.txt
 commands = py.test -v
 
 [testenv:pep8]
-deps = flake8
-commands = flake8-3 --ignore=E26,E501,E251,E225,E722 logreduce
+basepython = python3
+sitepackages = False
+deps = flake8<3.6.0
+commands = flake8 --ignore=E26,E501,E251,E225,E722 logreduce
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/logreduce-0.3.0/web/src/pages/UserReport.jsx 
new/logreduce-0.4.0/web/src/pages/UserReport.jsx
--- old/logreduce-0.3.0/web/src/pages/UserReport.jsx    2018-10-25 
11:23:44.000000000 +0200
+++ new/logreduce-0.4.0/web/src/pages/UserReport.jsx    2018-11-08 
08:39:56.000000000 +0100
@@ -97,7 +97,7 @@
       <Grid>
         <h2>Report a new build to be analyzed</h2>
         <p>Use the form bellow to report a Zuul build and trigger an automated
-        analyzis</p>
+        analyzes</p>
         <hr />
         <Form horizontal>
           <FormGroup controlId='name'>
@@ -118,7 +118,7 @@
             <Col sm={9}>
               <FormControl type='text' inputRef={i => this.reporter = i}/>
               <HelpBlock>
-                {'Enter your name like "irc-name" or "email address"'}
+                {'Enter your name like "IRC nick" or "Email address"'}
               </HelpBlock>
             </Col>
           </FormGroup>
@@ -157,7 +157,7 @@
                 ))}
               </FormControl>
               <HelpBlock>
-                Those are known Zuul API endpoints to query build informations.
+                Those are known Zuul API endpoints to query build information.
               </HelpBlock>
             </Col>
           </FormGroup>


Reply via email to