[GitHub] [incubator-tvm] trevor-m opened a new issue #6673: [CI] Segfault in ci-gpu image related to xgboost version

GitBox Mon, 12 Oct 2020 15:58:27 -0700


trevor-m opened a new issue #6673:
URL: https://github.com/apache/incubator-tvm/issues/6673



   This issue came up for this PR to add TRT BYOC support: 
https://github.com/apache/incubator-tvm/pull/6395#issuecomment-707363920
   Example failed CI run: 
https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-6395/30/pipeline
   
   It seems that enabling TRT BYOC codegen (`set(USE_TENSORRT_CODEGEN ON)`) 
exposed an unrelated bug found by `apps/bundle_deploy/bundle_deploy.py` during 
`tests/scro[ts/task_cpp_unittest.sh`. The python program segfaults when ran. We 
believe the issue is not with this test itself, but it just happens to be the 
first thing that runs a TVM python session and quits after building TVM.
   
   I reproduced the error inside of GDB to get the backtrace.
   ```
   Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
   __GI___libc_free (mem=0x6) at malloc.c:2958
   2958    malloc.c: No such file or directory.
   (gdb) bt
   #0  __GI___libc_free (mem=0x6) at malloc.c:2958
   #1  0x00007fffde4937f4 in 
dmlc::parameter::FieldAccessEntry::~FieldAccessEntry() () from 
/workspace/build/libtvm.so
   #2  0x00007fff9702a4af in 
dmlc::parameter::FieldEntry<std::string>::~FieldEntry() () from 
/usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
   #3  0x00007fff97037267 in dmlc::parameter::ParamManager::~ParamManager() () 
from /usr/local/lib/python3.6/dist-packages/xgboost/./lib/libxgboost.so
   #4  0x00007ffff6cd7008 in __run_exit_handlers (status=0, 
listp=0x7ffff70615f8 <__exit_funcs>, 
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
   #5  0x00007ffff6cd7055 in __GI_exit (status=<optimized out>) at exit.c:104
   #6  0x00007ffff6cbd847 in __libc_start_main (main=0x4d1cb0 <main>, argc=5, 
argv=0x7fffffffe858, init=<optimized out>, fini=<optimized out>, 
rtld_fini=<optimized out>, stack_end=0x7fffffffe848) at ../csu/libc-start.c:325
   #7  0x00000000005e8569 in _start ()
   ```
   
   Since I noticed that [TVM's setup.py requires a minimum XGBoost version of 
1.1.0](https://github.com/apache/incubator-tvm/blob/main/python/setup.py#L172), 
I noticed the CI docker only has 1.02. I tried 1.1.0 and 1.2.0 and found that 
both fixed the issue. 
   ```
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o 
build --test
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   Segmentation fault (core dumped)
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3
   Python 3.6.10 (default, Dec 19 2019, 23:04:32) 
   [GCC 5.4.0 20160609] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import xgboost
   >>> xgboost.__version__
   '1.0.2'
   >>> exit()
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user 
xgboost==1.1.0
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o 
build --test                                                                    
                                                                                
                                               
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ pip3 install --user 
xgboost==1.2.0
   ubuntu@ip-172-31-83-183:~/apps/bundle_deploy$ python3 build_model.py -o 
build --test                                                                    
                                                                                
                                               
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   INFO:compile_engine:Using injective.cpu for add based on highest priority 
(10)
   ```
   
   The issue looks similar to this one: 
https://discuss.xgboost.ai/t/segfault-during-code-cleanup/1365/6
   I have encountered this exact error when using TVM with an xgboost or 
Treelite in the same program when the dmlc-core commits do not match up. It 
looks maybe like dmlc-core should maybe check for nullptr before deleting the 
entries?
   
   So it looks like we can fix this by upgrading the xgboost version in the CI 
containers. It would be good to make the xgboost version consistent with the 
minimum version 1.1.0 in the `setup.py`. However, it seems like dmlc-core has 
some buggy behavior which won't be completely fixed.
   
   @areusch @comaniac @zhiics @tqchen 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-tvm] trevor-m opened a new issue #6673: [CI] Segfault in ci-gpu image related to xgboost version

Reply via email to