Thanks for your response, Rob. I will do my best to answer your questions. 
Please let me know if anything is unclear and more info would help. I 
appreciate your attention to this!

This is a rather powerful Dell workstation running Ubuntu 22.04 LTS, with a 
12-core Intel processor and 503GB RAM.

I'm running as a user with admin privileges, but am not using sudo, so as I 
understand these should not be root processes.

In short, we're running some custom Python code to analyze ~1.3GB hyperspectral 
images, do some linear algebra and output some plots and arrays describing the 
biochemical composition in these images. This is benchmarked to take 2-4GB of 
RAM per image. There is one image per job. By default, parallel​ is running 24 
jobs, dual-threading on each of 12 cores... There should be plenty of RAM to 
run 24 4GB jobs at once. Since this is an embarrassingly parallel computation 
and we already use bash scripting in this workflow, I prefer to keep it simple 
and use GNU Parallel rather than Python parallel frameworks... it always worked 
great in the past.

Here is the script I'm calling from the command line, inside the jobs file 
described further below: gmodetector_py/analyze_sample.py at master · 
naglemi/gmodetector_py 
(github.com)<https://github.com/naglemi/gmodetector_py/blob/master/wrappers/analyze_sample.py>

# This is what we run to execute the .jobs​ file
parallel -a $job_list_name​

# I have also tried limiting the number of jobs to 20, which also leads to the 
same crashing problem after a few runs.
​parallel--jobs 20 -a $job_list_name

# Here is how we prepare the .jobs​ file. We produce one job per image, each 
given its own line in a text file, with options set by a bunch of variables in 
a Jupyter notebook. Note, I have also confirmed it still crashes if we run 
outside of Jupyter.
for file in $data/*.hdr
do
 if [[ "$file" != *'hroma'* ]] && [[ "$file" != *'roadband'* ]]; then
  echo "python wrappers/analyze_sample.py \
--file_path $file \
--fluorophores ${fluorophores[*]} \
--min_desired_wavelength ${desired_wavelength_range[0]} \
--max_desired_wavelength ${desired_wavelength_range[1]} \
--red_channel ${FalseColor_channels[0]} \
--green_channel ${FalseColor_channels[1]} \
--blue_channel ${FalseColor_channels[2]} \
--red_cap ${FalseColor_caps[0]} \
--green_cap ${FalseColor_caps[1]} \
--blue_cap ${FalseColor_caps[2]} \
--plot 1 \
--spectral_library_path "$spectral_library_path" \
--output_dir $output_directory_full \
--threshold 38" >> $job_list_name
 fi
done

Thanks again!
________________________________
From: Rob Sargent <robjsarg...@gmail.com>
Sent: Saturday, July 9, 2022 2:59 PM
To: Nagle, Michael F <michael.na...@oregonstate.edu>
Cc: parallel <parallel@gnu.org>
Subject: Re: How to debug `parallel` crash?


[This email originated from outside of OSU. Use caution with links and 
attachments.]


On Jul 9, 2022, at 3:34 PM, Nagle, Michael F <michael.na...@oregonstate.edu> 
wrote:


Hello,

First, I’d like to thank the developers and community for producing GNU 
Parallel and supporting it.

I use GNU parallel for a particular part of a scientific workflow, and it 
worked great on a previous machine. On a new machine (with many more cores), 
I’m now having it crash sometimes and am having trouble debugging this.

When it crashes, the terminal it is being run from crashes, so I’m left with no 
error message or clues I can find as to why the crash occurred. How can I 
figure this out?

What I’ve tried and outcomes:
1. Restarting the machine and trying again… GNU parallel never crashes the 
first time it is run after a restart. After several runs, it crashes every 
time, and the machine needs to be restarted again before it will work. This 
leads me to suspect some kind of zombie processes may be left behind, but I 
don’t see anything suspicious with `top`.
2. Looking for log files… These could be very helpful and informative if 
they’re out there. I looked in /var/logs/ and in the directory from which 
`parallel` is being run, but haven’t found logs. I haven’t been able to find 
info about logs in documentation. Are there logs I should be able to find, and 
where?

Any advice for diagnosing and troubleshooting the problem would be greatly 
appreciated. Thanks for your time and help.

[https://res.spikenow.com/c/?id=05pf9ntcrlo9tm3v&s=48&m=c&_ts=1mr5al]           
Michael 
Nagle<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspikenow.com%2Fr%2Fa%2F%3Fref%3Dspike-organic-signature%26_ts%3D1mr5al&data=05%7C01%7Cmichael.nagle%40oregonstate.edu%7C17f894814ad5475f031808da61f64418%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637930007698832862%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=k5DVhyNZPnYx%2BCtHDkGL5RnqeveJ%2FextRo6QYVSxXA4%3D&reserved=0>
        [1mr5al]

Michael Nagle

PhD Candidate, Molecular and Cellular Biology

Forest Biotechnology 
Laboratory<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpeople.forestry.oregonstate.edu%2Fsteve-strauss%2Fhome-page&data=05%7C01%7Cmichael.nagle%40oregonstate.edu%7C17f894814ad5475f031808da61f64418%7Cce6d05e13c5e4d6287a84c4a2713c113%7C0%7C0%7C637930007698832862%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=SoJyMd80e0%2FsgYjD92aOjOdDcfv0OP1gcmQsOdtvnE4%3D&reserved=0>

Oregon State University

301-974-7221 (cell)


Are you crashing a Linux machine?  That would be impressive. Are you running as 
root. That would be dangerous.

Show the command line which causes the crash. Show all of it. In plain tex. 
Describe the data files. Maybe a hint at what the processing does. Describe 
your machine
[https://bolt.im/t/?zmWyTSCe_Fkzn63FnB0pMQQWkFlSasTQGMRUxVuKuObZNIOUSL-yEiP-SlLmAFYWJ_VKptjgwuYrGKQSFqPqEg]

Reply via email to