[issue43668] Segfault with for fresh ubuntu 20.04 install

axel Tue, 30 Mar 2021 04:05:55 -0700


New submission from axel <axel.gote...@mindified.com>:


The python interpreter segfaults when running in a miniconda environment on a 
fresh install of ubuntu 20.04.2. This seems to happen intermittently, both 
while running "pip" during the conda setup of an environment and during the 
execution of code like below. The issue is most have mostly been reproduced 
with conda, but seems to happen regardless, which is why I suspect it is a 
python bug. It is very odd that I can't seem to find anyone else with the same 
issue.

The segfault always occurs when running the following code, which reads texts 
from files and tokenizes the result. The segfault location changes from run to 
run. Also the exact same code can run on another computer with the same conda 
environment on a ubuntu 18.04.

The core dumps always points to some function in the unicodeobject.c file in 
python but the exact function changes from crash to crash. At least one crash 
has a clear dereferenced pointer 0x0 where the "unicode object" should be.

My guess is that something causes the python interpreter to throw away the 
pointed to unicode object while it is still being worked on causing a segfault. 
But any bug in the interpreter or NLTK should have been noticed by more users, 
and I cannot find anyone with similar issues. 

Things tried that didn't fix the issue:
1. Reformatting and reinstalling ubuntu
2. Switched to ubuntu 18.04 (on this computer, another computer with 18.04 can 
run the code just fine)
3. Replacing hardware, to ensure that RAM, or SSD disk isn't broken
4. Changing to python versions 3.8.6, 3.8.8, 3.9.2
5. Cloning the conda environment from a working computer to the broken one

Attached is one stacktrace of the fault handler along with it's corresponding 
core dump stack trace from gdb.

```
(eo) axel@minimind:~/test$ python tokenizer_mini.py 
2021-03-30 11:10:15.588399: W 
tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load 
dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open 
shared object file: No such file or directory
2021-03-30 11:10:15.588426: I 
tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror 
if you do not have a GPU set up on your machine.
Fatal Python error: Segmentation fault

Current thread 0x00007faa73bbe740 (most recent call first):
  File "tokenizer_mini.py", line 36 in preprocess_string
  File "tokenizer_mini.py", line 51 in <module>
Segmentation fault (core dumped)
```

```
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  find_maxchar_surrogates (num_surrogates=<synthetic pointer>, 
maxchar=<synthetic pointer>, 
    end=0x4 <error: Cannot access memory at address 0x4>, begin=0x0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:1703
#3  _PyUnicode_Ready (unicode=0x7f7e4e04d7f0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:1742
#4  0x000055cd65f6df6a in PyUnicode_RichCompare (left=0x7f7e4cf43fb0, 
right=<optimized out>, op=2)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:11205
#5  0x000055cd6601712a in do_richcompare (op=2, w=0x7f7e4e04d7f0, 
v=0x7f7e4cf43fb0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:726
#6  PyObject_RichCompare (op=2, w=0x7f7e4e04d7f0, v=0x7f7e4cf43fb0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:774
#7  PyObject_RichCompareBool (op=2, w=0x7f7e4e04d7f0, v=0x7f7e4cf43fb0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:796
#8  list_contains (a=0x7f7e4e04b4c0, el=0x7f7e4cf43fb0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/listobject.c:455
#9  0x000055cd660be41b in PySequence_Contains (ob=0x7f7e4cf43fb0, 
seq=0x7f7e4e04b4c0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/abstract.c:2083
#10 cmp_outcome (w=0x7f7e4e04b4c0, v=0x7f7e4cf43fb0, op=<optimized out>, 
tstate=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:5082
#11 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:2977
#12 0x000055cd6609f706 in PyEval_EvalFrameEx (throwflag=0, f=0x7f7e4f4d3c40)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:738
#13 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, 
args=<optimized out>, co=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/call.c:284
#14 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, 
nargsf=<optimized out>, kwnames=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/call.c:411
#15 0x000055cd660be54f in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized 
out>, args=0x7f7f391985b8, callable=0x7f7f39084160)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Include/cpython/abstract.h:115
#16 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic 
pointer>, tstate=0x55cd66c2e880)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4963
#17 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:3500
#18 0x000055cd6609e503 in PyEval_EvalFrameEx (throwflag=0, f=0x7f7f39198440)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4298
#19 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, 
locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=<optimized out>, kwargs=<optimized out>, 
kwcount=<optimized out>, kwstep=<optimized out>, 
    defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, 
closure=<optimized out>, name=<optimized out>, 
    qualname=<optimized out>) at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4298
#20 0x000055cd6609f559 in PyEval_EvalCodeEx (_co=<optimized out>, 
globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, 
kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4327
#21 0x000055cd661429ab in PyEval_EvalCode (co=<optimized out>, 
globals=<optimized out>, locals=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:718
#22 0x000055cd66142a43 in run_eval_code_obj (co=0x7f7f3910f240, 
globals=0x7f7f391fad80, locals=0x7f7f391fad80)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1165
#23 0x000055cd6615c6b3 in run_mod (mod=<optimized out>, filename=<optimized 
out>, globals=0x7f7f391fad80, locals=0x7f7f391fad80, 
    flags=<optimized out>, arena=<optimized out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1187
--Type <RET> for more, q to quit, c to continue without paging--
#24 0x000055cd661615b2 in pyrun_file (fp=0x55cd66c2cdf0, 
filename=0x7f7f391bbee0, start=<optimized out>, globals=0x7f7f391fad80, 
    locals=0x7f7f391fad80, closeit=1, flags=0x7ffe3ee6f8e8)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1084
#25 0x000055cd66161792 in pyrun_simple_file (flags=0x7ffe3ee6f8e8, closeit=1, 
filename=0x7f7f391bbee0, fp=0x55cd66c2cdf0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:439
#26 PyRun_SimpleFileExFlags (fp=0x55cd66c2cdf0, filename=<optimized out>, 
closeit=1, flags=0x7ffe3ee6f8e8)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:472
#27 0x000055cd66161d0d in pymain_run_file (cf=0x7ffe3ee6f8e8, 
config=0x55cd66c2da70)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:391
#28 pymain_run_python (exitcode=0x7ffe3ee6f8e0)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:616
#29 Py_RunMain () at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:695
#30 0x000055cd66161ec9 in Py_BytesMain (argc=<optimized out>, argv=<optimized 
out>)
    at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:1127
#31 0x00007f7f3a3620b3 in __libc_start_main (main=0x55cd65fe3490 <main>, 
argc=2, argv=0x7ffe3ee6fae8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe3ee6fad8) 
at ../csu/libc-start.c:308
#32 0x000055cd660d7369 in _start () at 
/home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ast.c:937
```

The conda environment used is below, using 
Miniconda3-py38_4.9.2-Linux-x86_64.sh (note that the segfault does sometimes 
occur during the setup of a conda environment so it's probably not related to 
the env)
```
name: eo
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8.8
  - pip=20.3.1
  - pip:
    - transformers==4.3.2
    - tensorflow_gpu==2.4.0
    - scikit-learn==0.23.2
    - nltk==3.5
    - matplotlib==3.2.1
    - seaborn==0.11.0
    - tensorflow-addons==0.11.2
    - tf-models-official==2.4.0
    - gspread==3.6.0
    - oauth2client==4.1.3
    - ipykernel==5.4.2
    - autopep8==1.5.4
    - torch==1.7.1
```


The code below consistently reproduces the problem, the files read are simple 
text files containing unicode text:

```python
from nltk.tokenize import wordpunct_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import pickle
from pathlib import Path
import faulthandler
faulthandler.enable()


def load_data(root_path, feature, index):
    feature_root = root_path / feature
    dir1 = str(index // 10_000)
    base_path = feature_root / dir1 / str(index)
    full_path = base_path.with_suffix('.txt')
    data = None
    with open(full_path, 'r', encoding='utf-8') as f:
        data = f.read()
    return data


def preprocess_string(text, stemmer, stop_words):
    word_tokens = wordpunct_tokenize(text.lower())
    alpha_tokens = []
    for w in word_tokens:
        try:
            if (w.isalpha() and w not in stop_words):
                alpha_tokens.append(w)
        except:
            print("Something went wrong when handling the word: ", w)

    clean_tokens = []
    for w in alpha_tokens:
        try:
            word = stemmer.stem(w)
            clean_tokens.append(word)
        except:
            print("Something went wrong when stemming the word: ", w)
            clean_tokens.append(w)
    return clean_tokens


stop_words = stopwords.words('english')
stemmer = SnowballStemmer(language='english')
tokenizer = Tokenizer()

root_path = '/srv/patent/EbbaOtto/E'
for idx in range(0, 57454):
    print(f'Processed {idx}/57454', end='\r')
    desc = str(load_data(Path(root_path), 'clean_description', idx))
    desc = preprocess_string(desc, stemmer, stop_words)
    tokenizer.fit_on_texts([desc])

```

For more readable formatting read the stackoverflow post regarding the same 
issue:
https://stackoverflow.com/questions/66868753/segfault-with-for-fresh-ubuntu-20-04-install-using-conda

----------
components: Interpreter Core
messages: 389816
nosy: axel_1234
priority: normal
severity: normal
status: open
title: Segfault with for fresh ubuntu 20.04 install
type: crash
versions: Python 3.7, Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43668>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43668] Segfault with for fresh ubuntu 20.04 install

Reply via email to