[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-19 Thread Aiden Grossman via cfe-commits

https://github.com/boomanaiden154 closed 
https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-17 Thread Aiden Grossman via cfe-commits


@@ -0,0 +1,6 @@
+# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

boomanaiden154 wrote:

Looks like all the other Python tests within the monorepo are pretty much lit 
tests. I'll work on converting these tests to lit tests later today. Should be 
feasible since we're essentially just using Python tooling to validate where 
files are, which we can easily do in lit.

https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-17 Thread Aiden Grossman via cfe-commits


@@ -0,0 +1,6 @@
+# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

boomanaiden154 wrote:

Or what's probably better for structuring is we can do `mlgo/mlgo/corpus` and 
then the package would be accessed as `mlgo.corpus` while still keeping 
everything together if we want to add more in the future. I'll switch to that 
for now.

https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-17 Thread Aiden Grossman via cfe-commits


@@ -0,0 +1,12 @@
+# MLGO Python Library
+
+This folder contains the MLGO python library. This library consists of telling

boomanaiden154 wrote:

Updated it to calling this the folder for MLGO Python Utilities. Good catch on 
the first line. Not sure exactly what happened there.

https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-17 Thread Aiden Grossman via cfe-commits

boomanaiden154 wrote:

> Would it be also possible to remove the dependency on 
> [Abseil](https://github.com/abseil/abseil-py)? None of the existing scripts 
> in LLVM use it and I don't think we should be introducing this dependency. It 
> looks like Abseil is only used for flag parsing, logging and testing; those 
> should be straightforward to replace with standard libraries like `argparse`, 
> `logging` or `unittest`.

Yes. My plan was to remove the dependency on abseil as well. My plan was to get 
this landed with all the infrastructure setup and the code basically just 
directly copied and then remove the abseil dependency in a follow-up patch so 
that the different pieces get reviewed appropriately.

https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-16 Thread Mircea Trofin via cfe-commits

https://github.com/mtrofin edited 
https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-14 Thread Aiden Grossman via cfe-commits

boomanaiden154 wrote:

After this lands, my plan is to work on getting CI up and running, both to run 
testing and also to publish the package.

https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-14 Thread Aiden Grossman via cfe-commits

https://github.com/boomanaiden154 ready_for_review 
https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-14 Thread Aiden Grossman via cfe-commits

https://github.com/boomanaiden154 edited 
https://github.com/llvm/llvm-project/pull/72319
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-14 Thread Aiden Grossman via cfe-commits

https://github.com/boomanaiden154 updated 
https://github.com/llvm/llvm-project/pull/72319

>From c3f723c8a975cc5e075d56350645b0be486f3cda Mon Sep 17 00:00:00 2001
From: Aiden Grossman 
Date: Tue, 14 Nov 2023 14:20:24 -0800
Subject: [PATCH 1/2] [MLGO] Upstream the corpus extraction tooling

---
 llvm/py/Pyproject.toml|   1 +
 llvm/py/src/mlgo/combine_training_corpus.py   |  55 +++
 .../src/mlgo/combine_training_corpus_lib.py   |  50 +++
 .../src/mlgo/combine_training_corpus_test.py  | 104 +
 llvm/py/src/mlgo/extract_ir.py| 142 +++
 llvm/py/src/mlgo/extract_ir_lib.py| 373 ++
 llvm/py/src/mlgo/extract_ir_test.py   | 231 +++
 llvm/py/src/mlgo/make_corpus.py   |  58 +++
 llvm/py/src/mlgo/make_corpus_lib.py   |  90 +
 llvm/py/src/mlgo/make_corpus_test.py  |  66 
 10 files changed, 1170 insertions(+)
 create mode 100644 llvm/py/Pyproject.toml
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus.py
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus_lib.py
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus_test.py
 create mode 100644 llvm/py/src/mlgo/extract_ir.py
 create mode 100644 llvm/py/src/mlgo/extract_ir_lib.py
 create mode 100644 llvm/py/src/mlgo/extract_ir_test.py
 create mode 100644 llvm/py/src/mlgo/make_corpus.py
 create mode 100644 llvm/py/src/mlgo/make_corpus_lib.py
 create mode 100644 llvm/py/src/mlgo/make_corpus_test.py

diff --git a/llvm/py/Pyproject.toml b/llvm/py/Pyproject.toml
new file mode 100644
index 00..dcf2c804da5e19
--- /dev/null
+++ b/llvm/py/Pyproject.toml
@@ -0,0 +1 @@
+# Placeholder
diff --git a/llvm/py/src/mlgo/combine_training_corpus.py 
b/llvm/py/src/mlgo/combine_training_corpus.py
new file mode 100644
index 00..94ee1cbac9cea4
--- /dev/null
+++ b/llvm/py/src/mlgo/combine_training_corpus.py
@@ -0,0 +1,55 @@
+# coding=utf-8
+# Copyright 2020 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+r"""Combine multiple training corpus into a single training corpus.
+
+Currently only support the case that multiple corpus share the same
+configurables except the "modules" field.
+
+Usage: we'd like to combine training corpus corpus1 and corpus2 into
+combinedcorpus; we first structure the files as follows:
+
+combinedcorpus
+combinedcorpus/corpus1
+combinedcorpus/corpus2
+
+Running this script with
+
+python3 \
+compiler_opt/tools/combine_training_corpus.py \
+  --root_dir=$PATH_TO_combinedcorpus
+
+generates combinedcorpus/corpus_description.json file. In this way corpus1
+and corpus2 are combined into combinedcorpus.
+"""
+
+from absl import app
+from absl import flags
+
+from compiler_opt.tools import combine_training_corpus_lib
+
+flags.DEFINE_string('root_dir', '', 'root dir of module paths to combine.')
+
+FLAGS = flags.FLAGS
+
+
+def main(argv):
+  if len(argv) > 1:
+raise app.UsageError('Too many command-line arguments.')
+
+  combine_training_corpus_lib.combine_corpus(FLAGS.root_dir)
+
+
+if __name__ == '__main__':
+  app.run(main)
diff --git a/llvm/py/src/mlgo/combine_training_corpus_lib.py 
b/llvm/py/src/mlgo/combine_training_corpus_lib.py
new file mode 100644
index 00..0359961266a240
--- /dev/null
+++ b/llvm/py/src/mlgo/combine_training_corpus_lib.py
@@ -0,0 +1,50 @@
+# coding=utf-8
+# Copyright 2020 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Library for combining training corpora."""
+
+import os
+import json
+
+from absl import logging
+
+import tensorflow as tf
+
+_FILE_NAME = 'corpus_description.json'
+
+
+def combine_corpus(root_dir: str) -> None:
+  module_names = []
+  output_corpus_description = {}
+
+  corpus_description_glob = os.path.join(root_dir, '*/' + _FILE_NAME)
+  for corpus_description_path in tf.io.gfile.glob(corpus_description_glob):
+logging.info('processing %s', corpus_description_path)
+
+with 

[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)

2024-01-14 Thread Aiden Grossman via cfe-commits

https://github.com/boomanaiden154 updated 
https://github.com/llvm/llvm-project/pull/72319

>From c3f723c8a975cc5e075d56350645b0be486f3cda Mon Sep 17 00:00:00 2001
From: Aiden Grossman 
Date: Tue, 14 Nov 2023 14:20:24 -0800
Subject: [PATCH] [MLGO] Upstream the corpus extraction tooling

---
 llvm/py/Pyproject.toml|   1 +
 llvm/py/src/mlgo/combine_training_corpus.py   |  55 +++
 .../src/mlgo/combine_training_corpus_lib.py   |  50 +++
 .../src/mlgo/combine_training_corpus_test.py  | 104 +
 llvm/py/src/mlgo/extract_ir.py| 142 +++
 llvm/py/src/mlgo/extract_ir_lib.py| 373 ++
 llvm/py/src/mlgo/extract_ir_test.py   | 231 +++
 llvm/py/src/mlgo/make_corpus.py   |  58 +++
 llvm/py/src/mlgo/make_corpus_lib.py   |  90 +
 llvm/py/src/mlgo/make_corpus_test.py  |  66 
 10 files changed, 1170 insertions(+)
 create mode 100644 llvm/py/Pyproject.toml
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus.py
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus_lib.py
 create mode 100644 llvm/py/src/mlgo/combine_training_corpus_test.py
 create mode 100644 llvm/py/src/mlgo/extract_ir.py
 create mode 100644 llvm/py/src/mlgo/extract_ir_lib.py
 create mode 100644 llvm/py/src/mlgo/extract_ir_test.py
 create mode 100644 llvm/py/src/mlgo/make_corpus.py
 create mode 100644 llvm/py/src/mlgo/make_corpus_lib.py
 create mode 100644 llvm/py/src/mlgo/make_corpus_test.py

diff --git a/llvm/py/Pyproject.toml b/llvm/py/Pyproject.toml
new file mode 100644
index 00..dcf2c804da5e19
--- /dev/null
+++ b/llvm/py/Pyproject.toml
@@ -0,0 +1 @@
+# Placeholder
diff --git a/llvm/py/src/mlgo/combine_training_corpus.py 
b/llvm/py/src/mlgo/combine_training_corpus.py
new file mode 100644
index 00..94ee1cbac9cea4
--- /dev/null
+++ b/llvm/py/src/mlgo/combine_training_corpus.py
@@ -0,0 +1,55 @@
+# coding=utf-8
+# Copyright 2020 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+r"""Combine multiple training corpus into a single training corpus.
+
+Currently only support the case that multiple corpus share the same
+configurables except the "modules" field.
+
+Usage: we'd like to combine training corpus corpus1 and corpus2 into
+combinedcorpus; we first structure the files as follows:
+
+combinedcorpus
+combinedcorpus/corpus1
+combinedcorpus/corpus2
+
+Running this script with
+
+python3 \
+compiler_opt/tools/combine_training_corpus.py \
+  --root_dir=$PATH_TO_combinedcorpus
+
+generates combinedcorpus/corpus_description.json file. In this way corpus1
+and corpus2 are combined into combinedcorpus.
+"""
+
+from absl import app
+from absl import flags
+
+from compiler_opt.tools import combine_training_corpus_lib
+
+flags.DEFINE_string('root_dir', '', 'root dir of module paths to combine.')
+
+FLAGS = flags.FLAGS
+
+
+def main(argv):
+  if len(argv) > 1:
+raise app.UsageError('Too many command-line arguments.')
+
+  combine_training_corpus_lib.combine_corpus(FLAGS.root_dir)
+
+
+if __name__ == '__main__':
+  app.run(main)
diff --git a/llvm/py/src/mlgo/combine_training_corpus_lib.py 
b/llvm/py/src/mlgo/combine_training_corpus_lib.py
new file mode 100644
index 00..0359961266a240
--- /dev/null
+++ b/llvm/py/src/mlgo/combine_training_corpus_lib.py
@@ -0,0 +1,50 @@
+# coding=utf-8
+# Copyright 2020 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Library for combining training corpora."""
+
+import os
+import json
+
+from absl import logging
+
+import tensorflow as tf
+
+_FILE_NAME = 'corpus_description.json'
+
+
+def combine_corpus(root_dir: str) -> None:
+  module_names = []
+  output_corpus_description = {}
+
+  corpus_description_glob = os.path.join(root_dir, '*/' + _FILE_NAME)
+  for corpus_description_path in tf.io.gfile.glob(corpus_description_glob):
+logging.info('processing %s', corpus_description_path)
+
+with