Hi Michal,
    Thank you for the suggestion of POCL_CACHE_DIR.   Setting this to a
tmps unique to each compute node immediately worked around the issue.
    I can now reliably run my application.
    On most Cray systems, $HOME is a DFS mount when mounted on compute
nodes.  I'm sure there are many similarities from DFS to NFS.

    I would think a better default location for the pocl cache (linux)
would be derived from $TMPDIR rather than $HOME.

   I wonder if sys::fs::createUniqueFile()  is not so unique after-all at
this scale?  Could this lead to a sort of race between the create and
open(exclusive)...?

Cheers,

Noah





On Thu, Dec 27, 2018 at 11:14 AM Michal Babej <[email protected]> wrote:

> Hello,
>
>
> > Is pocl or clang trying to write anything to the working directory?  In
> my restricted case, /tmp is private to each compute node and thus each
> process.
>
>
> Not to the working directory (AFAIK, i haven't inspected the entire Clang
> codebase), but pocl writes to its own cache directory, which by default is
> $HOME/.cache/pocl/kcache; you can change it to a different directory by
> setting the POCL_CACHE_DIR env variable.
>
>
> IIRC there have been some issues before, when people had the cache dir
> located on NFS shares; is that your case (is your $HOME shared) ? You could
> try pointing POCL_CACHE_DIR to /tmp/pocl_cache and see if it makes the
> problem go away. It's possible pocl / Clang makes some assumption about
> filesystem which does not hold for NFS.
>
>
> In the backtrace you pasted, it seems it's crashing in the preprocessing
> phase. Here pocl writes to a temporary file created by LLVM's 
> sys::fs::createUniqueFile()
> which in turn uses open() with exclusive flag on a randomized  path.
>
>
> Regards,
>
> -- mb
> ------------------------------
> *From:* Noah Reddell <[email protected]>
> *Sent:* Saturday, December 22, 2018 12:09:55 AM
> *To:* [email protected]
> *Subject:* [pocl-devel] intermittent clang ComputeLineNumbers SegFault
>
> Hi,
>
>       I figured it is about time I give pocl a try with my physics
> simulation code.   I've been using Intel's OpenCL library for computing on
> Cray systems with Xeon CPU.
>        Today I built pocl (today's git master ) on a Cray XC40
> using clang+llvm-7.0.0-x86_64-linux-sles12.3
>        I was able to run a simple Hello World kernel as well as clinfo.
> When running my physics application at necessary scale, I'm seeing about
> 0.2% of clBuildProgram fail by SEGFAULT, all with a common stack signature.
> (pasted below)
>        I'm not sure why this would be so intermittent.  I've tried
> reducing to one process per compute node, so only one clBuildProgram would
> be executing on that node at a time.  In this testing, that leaves 90
> processes doing the same program compile simultaneously in the same working
> directory.   Is pocl or clang trying to write anything to the working
> directory?  In my restricted case, /tmp is private to each compute node and
> thus each process.
>      Google-ing for similar stack language, I find one mention that may
> well be the same bug:
> https://www.mail-archive.com/[email protected]/msg28677.html
> https://bugs.llvm.org/show_bug.cgi?id=39833
>
>     "poclcc" is successful with the same OpenCL kernel source.  I assume
> I'd need to run it hundreds of times, perhaps in parallel to potentially
> trigger the same bug.
>
>       Any advice would be appreciated.  Now that I've thought through the
> situation, I think I should probably create an account and contribute to
> the LLVM bug 39833 discussion with a me-too.
>
> Cheers,
>
> Noah Reddell
>
>
>   WmResidentPatchProcessor::WmResidentPatchProcessor(WmComputeProgram*,
> boost::shared_ptr<WmComputeAssignment const>,
> std::vector<boost::shared_ptr<WmSubDomain const>,
> std::allocator<boost::shared_ptr<WmSubDomain const> > > const&,
> WmComputeMachine&)@wmresidentpatchprocessor.cc:358
>   [email protected]:37
>   compile_and_link_program@pocl_build.c:624
>   pocl_llvm_build_program@pocl_llvm_build.cc:489
>
> clang::CompilerInstance::ExecuteAction(clang::FrontendAction&)@0x2aaaabebfd07
>   clang::FrontendAction::Execute()@0x2aaaabf1c106
>   clang::PrintPreprocessedAction::ExecuteAction()@0x2aaaabf22328
>   clang::DoPrintPreprocessedInput(clang::Preprocessor&,
> llvm::raw_ostream*, clang::PreprocessorOutputOptions const&)@0x2aaaabf51226
>   clang::Preprocessor::EnterMainSourceFile()@0x2aaaacc1cabc
>   clang::Preprocessor::EnterSourceFile(clang::FileID,
> clang::DirectoryLookup const*, clang::SourceLocation)@0x2aaaacbf7407
>   (anonymous
> namespace)::PrintPPOutputPPCallbacks::FileChanged(clang::SourceLocation,
> clang::PPCallbacks::FileChangeReason, clang::SrcMgr::CharacteristicKind,
> clang::FileID)@0x2aaaabf5212d
>   clang::SourceManager::getPresumedLoc(clang::SourceLocation, bool)
> const@0x2aaaacc4e00e
>   clang::SourceManager::getLineNumber(clang::FileID, unsigned int, bool*)
> const@0x2aaaacc4e43a
>   *ComputeLineNumbers*(clang::DiagnosticsEngine&,
> clang::SrcMgr::ContentCache*,
> llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>&,
> clang::SourceManager const&, bool&)@0x2aaaacc4e683
>
>
>
> _______________________________________________
> pocl-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to