[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/boomanaiden154 closed https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
@@ -0,0 +1,6 @@ +# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. boomanaiden154 wrote: Looks like all the other Python tests within the monorepo are pretty much lit tests. I'll work on converting these tests to lit tests later today. Should be feasible since we're essentially just using Python tooling to validate where files are, which we can easily do in lit. https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
@@ -0,0 +1,6 @@ +# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. boomanaiden154 wrote: Or what's probably better for structuring is we can do `mlgo/mlgo/corpus` and then the package would be accessed as `mlgo.corpus` while still keeping everything together if we want to add more in the future. I'll switch to that for now. https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
@@ -0,0 +1,12 @@ +# MLGO Python Library + +This folder contains the MLGO python library. This library consists of telling boomanaiden154 wrote: Updated it to calling this the folder for MLGO Python Utilities. Good catch on the first line. Not sure exactly what happened there. https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
boomanaiden154 wrote: > Would it be also possible to remove the dependency on > [Abseil](https://github.com/abseil/abseil-py)? None of the existing scripts > in LLVM use it and I don't think we should be introducing this dependency. It > looks like Abseil is only used for flag parsing, logging and testing; those > should be straightforward to replace with standard libraries like `argparse`, > `logging` or `unittest`. Yes. My plan was to remove the dependency on abseil as well. My plan was to get this landed with all the infrastructure setup and the code basically just directly copied and then remove the abseil dependency in a follow-up patch so that the different pieces get reviewed appropriately. https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/mtrofin edited https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
boomanaiden154 wrote: After this lands, my plan is to work on getting CI up and running, both to run testing and also to publish the package. https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/boomanaiden154 ready_for_review https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/boomanaiden154 edited https://github.com/llvm/llvm-project/pull/72319 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/boomanaiden154 updated https://github.com/llvm/llvm-project/pull/72319 >From c3f723c8a975cc5e075d56350645b0be486f3cda Mon Sep 17 00:00:00 2001 From: Aiden Grossman Date: Tue, 14 Nov 2023 14:20:24 -0800 Subject: [PATCH 1/2] [MLGO] Upstream the corpus extraction tooling --- llvm/py/Pyproject.toml| 1 + llvm/py/src/mlgo/combine_training_corpus.py | 55 +++ .../src/mlgo/combine_training_corpus_lib.py | 50 +++ .../src/mlgo/combine_training_corpus_test.py | 104 + llvm/py/src/mlgo/extract_ir.py| 142 +++ llvm/py/src/mlgo/extract_ir_lib.py| 373 ++ llvm/py/src/mlgo/extract_ir_test.py | 231 +++ llvm/py/src/mlgo/make_corpus.py | 58 +++ llvm/py/src/mlgo/make_corpus_lib.py | 90 + llvm/py/src/mlgo/make_corpus_test.py | 66 10 files changed, 1170 insertions(+) create mode 100644 llvm/py/Pyproject.toml create mode 100644 llvm/py/src/mlgo/combine_training_corpus.py create mode 100644 llvm/py/src/mlgo/combine_training_corpus_lib.py create mode 100644 llvm/py/src/mlgo/combine_training_corpus_test.py create mode 100644 llvm/py/src/mlgo/extract_ir.py create mode 100644 llvm/py/src/mlgo/extract_ir_lib.py create mode 100644 llvm/py/src/mlgo/extract_ir_test.py create mode 100644 llvm/py/src/mlgo/make_corpus.py create mode 100644 llvm/py/src/mlgo/make_corpus_lib.py create mode 100644 llvm/py/src/mlgo/make_corpus_test.py diff --git a/llvm/py/Pyproject.toml b/llvm/py/Pyproject.toml new file mode 100644 index 00..dcf2c804da5e19 --- /dev/null +++ b/llvm/py/Pyproject.toml @@ -0,0 +1 @@ +# Placeholder diff --git a/llvm/py/src/mlgo/combine_training_corpus.py b/llvm/py/src/mlgo/combine_training_corpus.py new file mode 100644 index 00..94ee1cbac9cea4 --- /dev/null +++ b/llvm/py/src/mlgo/combine_training_corpus.py @@ -0,0 +1,55 @@ +# coding=utf-8 +# Copyright 2020 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +r"""Combine multiple training corpus into a single training corpus. + +Currently only support the case that multiple corpus share the same +configurables except the "modules" field. + +Usage: we'd like to combine training corpus corpus1 and corpus2 into +combinedcorpus; we first structure the files as follows: + +combinedcorpus +combinedcorpus/corpus1 +combinedcorpus/corpus2 + +Running this script with + +python3 \ +compiler_opt/tools/combine_training_corpus.py \ + --root_dir=$PATH_TO_combinedcorpus + +generates combinedcorpus/corpus_description.json file. In this way corpus1 +and corpus2 are combined into combinedcorpus. +""" + +from absl import app +from absl import flags + +from compiler_opt.tools import combine_training_corpus_lib + +flags.DEFINE_string('root_dir', '', 'root dir of module paths to combine.') + +FLAGS = flags.FLAGS + + +def main(argv): + if len(argv) > 1: +raise app.UsageError('Too many command-line arguments.') + + combine_training_corpus_lib.combine_corpus(FLAGS.root_dir) + + +if __name__ == '__main__': + app.run(main) diff --git a/llvm/py/src/mlgo/combine_training_corpus_lib.py b/llvm/py/src/mlgo/combine_training_corpus_lib.py new file mode 100644 index 00..0359961266a240 --- /dev/null +++ b/llvm/py/src/mlgo/combine_training_corpus_lib.py @@ -0,0 +1,50 @@ +# coding=utf-8 +# Copyright 2020 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Library for combining training corpora.""" + +import os +import json + +from absl import logging + +import tensorflow as tf + +_FILE_NAME = 'corpus_description.json' + + +def combine_corpus(root_dir: str) -> None: + module_names = [] + output_corpus_description = {} + + corpus_description_glob = os.path.join(root_dir, '*/' + _FILE_NAME) + for corpus_description_path in tf.io.gfile.glob(corpus_description_glob): +logging.info('processing %s', corpus_description_path) + +with
[clang-tools-extra] [llvm] [MLGO] Upstream the corpus extraction tooling (PR #72319)
https://github.com/boomanaiden154 updated https://github.com/llvm/llvm-project/pull/72319 >From c3f723c8a975cc5e075d56350645b0be486f3cda Mon Sep 17 00:00:00 2001 From: Aiden Grossman Date: Tue, 14 Nov 2023 14:20:24 -0800 Subject: [PATCH] [MLGO] Upstream the corpus extraction tooling --- llvm/py/Pyproject.toml| 1 + llvm/py/src/mlgo/combine_training_corpus.py | 55 +++ .../src/mlgo/combine_training_corpus_lib.py | 50 +++ .../src/mlgo/combine_training_corpus_test.py | 104 + llvm/py/src/mlgo/extract_ir.py| 142 +++ llvm/py/src/mlgo/extract_ir_lib.py| 373 ++ llvm/py/src/mlgo/extract_ir_test.py | 231 +++ llvm/py/src/mlgo/make_corpus.py | 58 +++ llvm/py/src/mlgo/make_corpus_lib.py | 90 + llvm/py/src/mlgo/make_corpus_test.py | 66 10 files changed, 1170 insertions(+) create mode 100644 llvm/py/Pyproject.toml create mode 100644 llvm/py/src/mlgo/combine_training_corpus.py create mode 100644 llvm/py/src/mlgo/combine_training_corpus_lib.py create mode 100644 llvm/py/src/mlgo/combine_training_corpus_test.py create mode 100644 llvm/py/src/mlgo/extract_ir.py create mode 100644 llvm/py/src/mlgo/extract_ir_lib.py create mode 100644 llvm/py/src/mlgo/extract_ir_test.py create mode 100644 llvm/py/src/mlgo/make_corpus.py create mode 100644 llvm/py/src/mlgo/make_corpus_lib.py create mode 100644 llvm/py/src/mlgo/make_corpus_test.py diff --git a/llvm/py/Pyproject.toml b/llvm/py/Pyproject.toml new file mode 100644 index 00..dcf2c804da5e19 --- /dev/null +++ b/llvm/py/Pyproject.toml @@ -0,0 +1 @@ +# Placeholder diff --git a/llvm/py/src/mlgo/combine_training_corpus.py b/llvm/py/src/mlgo/combine_training_corpus.py new file mode 100644 index 00..94ee1cbac9cea4 --- /dev/null +++ b/llvm/py/src/mlgo/combine_training_corpus.py @@ -0,0 +1,55 @@ +# coding=utf-8 +# Copyright 2020 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +r"""Combine multiple training corpus into a single training corpus. + +Currently only support the case that multiple corpus share the same +configurables except the "modules" field. + +Usage: we'd like to combine training corpus corpus1 and corpus2 into +combinedcorpus; we first structure the files as follows: + +combinedcorpus +combinedcorpus/corpus1 +combinedcorpus/corpus2 + +Running this script with + +python3 \ +compiler_opt/tools/combine_training_corpus.py \ + --root_dir=$PATH_TO_combinedcorpus + +generates combinedcorpus/corpus_description.json file. In this way corpus1 +and corpus2 are combined into combinedcorpus. +""" + +from absl import app +from absl import flags + +from compiler_opt.tools import combine_training_corpus_lib + +flags.DEFINE_string('root_dir', '', 'root dir of module paths to combine.') + +FLAGS = flags.FLAGS + + +def main(argv): + if len(argv) > 1: +raise app.UsageError('Too many command-line arguments.') + + combine_training_corpus_lib.combine_corpus(FLAGS.root_dir) + + +if __name__ == '__main__': + app.run(main) diff --git a/llvm/py/src/mlgo/combine_training_corpus_lib.py b/llvm/py/src/mlgo/combine_training_corpus_lib.py new file mode 100644 index 00..0359961266a240 --- /dev/null +++ b/llvm/py/src/mlgo/combine_training_corpus_lib.py @@ -0,0 +1,50 @@ +# coding=utf-8 +# Copyright 2020 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Library for combining training corpora.""" + +import os +import json + +from absl import logging + +import tensorflow as tf + +_FILE_NAME = 'corpus_description.json' + + +def combine_corpus(root_dir: str) -> None: + module_names = [] + output_corpus_description = {} + + corpus_description_glob = os.path.join(root_dir, '*/' + _FILE_NAME) + for corpus_description_path in tf.io.gfile.glob(corpus_description_glob): +logging.info('processing %s', corpus_description_path) + +with