Hi Even,
Your changes have fixed the issue with multiple threads, now they can write
simultaneously without getting locked. I've done some test in GTiff format and
everything went OK. However, using a custom-made driver for a new format
(called Wizard), a new problem has arisen :( .
In that case, although threads finish its task and generate its corresponding
output raster, 2 weird things happen randomly and (again) only when the block
cache gets full during writing:
- Some blocks (lines) are missing, so they appear in the file with no data
value.
- Resulting file format is corrupted.
It seems that 2 or more threads writing simultaneously (althought different
threads) mess up the block cache to each other when trying to flush dirty
blocks. I have tried to manually flush dirty blocks (either calling
block->FlushDirtyBlocks() or block->Write() ) everytime a block is completely
written in the cache, but it doesn't work.
My driver is simple, nothing too complex, actually was mostly taken from other
drivers already implemented. Check below the IWriteMethod, just in case I made
a mistake.
Do you have any clue what is happening here?
// *********************************************************************
CPLErr WIZARDRasterBand::IWriteBlock(CPL_UNUSED int nBlockXOff, int nBlockYOff,
void * pImage)
{
CPLAssert(nBlockXOff == 0);
CPLAssert(eAccess == GA_Update);
// By default, the count of pixels we want to write will be the natural
block
// size in pixels, except for the case of a padded block.
int nPixelCount = this->nBlockSizeInPixels;
if (this->bBlockPadding && (nBlockYOff == this->nPaddedBlockYIndex))
{
nPixelCount = this->nPaddedBlockSizeInPixels;
}
// Lets move it to the first position of the desired block.
if (VSIFSeekL(this->poWDS->fp,
this->nHeaderSizeInBytes + this->nBlockSizeInBytes * nBlockYOff,
SEEK_SET) != 0)
{
CPLError(CE_Failure, CPLE_FileIO,
"Seek failed when writing block to Wizard driver: %s",
VSIStrerror(errno));
return CE_Failure;
}
// Now Write.
if (VSIFWriteL(pImage, this->nPixelSizeInBytes, nPixelCount,
this->poWDS->fp) !=
nPixelCount)
{
CPLError(CE_Failure, CPLE_FileIO,
"Write failed when dumping block to Wizard driver: %s",
VSIStrerror(errno));
return CE_Failure;
}
return CE_None;
}
// *********************************************************************
Thanks,
BR. Javier.
-----Original Message-----
From: gdal-dev [mailto:[email protected]] On Behalf Of Francisco
Javier Calzado
Sent: 27 September, 2016 09:50
To: Even Rouault <[email protected]>; [email protected]
Subject: Re: [gdal-dev] Multithread deadlock
Hi Even,
Thanks for such a quick fix! I'm gonna apply the patch and recompile GDAL and
will let you know :)
Keep in touch.
Best Regards,
Javier Calzado
-----Original Message-----
From: Even Rouault [mailto:[email protected]]
Sent: 26 September, 2016 17:53
To: [email protected]
Cc: Francisco Javier Calzado <[email protected]>; Andrew
Bell <[email protected]>
Subject: Re: [gdal-dev] Multithread deadlock
Hi,
I admire Andrew's enthousiasm and would happily let him tackle the next bug
reported in this area ;-)
I could reproduce the deadlock with the same stack trace and have pushed a fix
per https://trac.osgeo.org/gdal/ticket/6661. This was actually not a typical
deadlock situation, but a undefined behaviour caused by trying to acquire a
recursive mutex than was previously released more times than it had been
acquired.
Without this patch, a "workaround" would be to define the
GDAL_ENABLE_READ_WRITE_MUTEX config option to NO to disable the per-dataset
mutex. had added this option since I wasn't really sure that the per-dataset
mutex wouldn't introduce deadlock situations. But when defining it, you'll get
undefined behaviour (=potentially crashing or causing corruptions) due to 2
threads potentially calling the IWriteBlock() method of the same dataset,which
was the GDAL 1.X behaviour.
Clearly multi-threading scenarios involving writing is the point where the
global block cache mechanism + the band-aid of the per-dataset R/W mutex are
showing their limit in terms of design&maintenance complexity, and scalability.
A per-dataset block cache would avoid such headaches (the drawback would be to
define a per-dataset block cache size)
Even
> Sure Andrew,
>
> Here it is the call stack from Visual Studio for both threads (I just
> copied the top calls where GDAL is involved, just for easy reading. If
> you need the whole stack just let me know):
>
> THREAD 1:
>
> ntdll.dll!_NtWaitForSingleObject@12‑() Unknown
> ntdll.dll!_RtlpWaitOnCriticalSection@8‑()
> Unknown ntdll.dll!_RtlEnterCriticalSection@4‑() Unknown
> gdal201.dll!CPLAcquireMutex(_CPLMutex * hMutexIn, double
> dfWaitInSeconds) Line 806 C++
> gdal201.dll!GDALDataset::EnterReadWrite(GDALRWFlag eRWFlag) Line 6102
> C++ gdal201.dll!GDALRasterBand::EnterReadWrite(GDALRWFlag eRWFlag) Line
> 5290 C++ gdal201.dll!GDALRasterBlock::Write() Line 742 C++
> gdal201.dll!GDALRasterBlock::Internalize() Line 917 C++
> gdal201.dll!GDALRasterBand::GetLockedBlockRef(int nXBlockOff, int
> nYBlockOff, int bJustInitialize) Line 1126 C++
>
> > Test.exe!RasterBandPixelAccess::SetValueAtPixel<short>(const
> > int & pX, const int & pY, const short & value) Line 180
> > C++
>
> THREAD 2:
>
> ntdll.dll!_NtWaitForSingleObject@12‑() Unknown
> KernelBase.dll!_WaitForSingleObjectEx@12‑() Unknown
> kernel32.dll!_WaitForSingleObjectExImplementation@12‑()
> Unknown kernel32.dll!_WaitForSingleObject@8‑() Unknown
> gdal201.dll!CPLCondWait(_CPLCond * hCond, _CPLMutex *
> hClientMutex) Line 937 C++
> gdal201.dll!GDALAbstractBandBlockCache::WaitKeepAliveCounter() Line 134
> C++ gdal201.dll!GDALArrayBandBlockCache::FlushCache() Line 312 C++
> gdal201.dll!GDALRasterBand::FlushCache() Line 865 C++
> gdal201.dll!GDALDataset::FlushCache() Line 386 C++
> gdal201.dll!GDALPamDataset::FlushCache() Line 159 C++
> gdal201.dll!GTiffDataset::Finalize() Line 6180 C++
> gdal201.dll!GTiffDataset::~GTiffDataset() Line 6135
> C++ gdal201.dll!GTiffDataset::`scalar deleting destructor'(unsigned int)
> C++ gdal201.dll!GDALClose(void * hDS) Line 2998 C++
>
> > Test.exe!main::__l2::<lambda>(std::basic_string<char,std::cha
> > r_traits<char>,std::allocator<char> > sourcefilePath,
> > std::basic_string<char,std::char_traits<char>,std::allocator
> > <char> > targetFilePath, int threadID) Line 66 C++
>
> From: Andrew Bell [mailto:[email protected]]
> Sent: 26 September, 2016 16:06
> To: Francisco Javier Calzado <[email protected]>
> Cc: [email protected]
> Subject: Re: [gdal-dev] Multithread deadlock
>
> Deadlocks are usually easy to debug if you can get a traceback when
> deadlocked. If you can attach with gdb (or run in the debugger) and
> reproduce and post the stack at the time ('where' from gdb), it should
> be no problem to fix. Trying to reproduce on different hardware can
> be difficult.
>
> On Mon, Sep 26, 2016 at 9:33 AM, Francisco Javier Calzado
> <[email protected]<mailto:francisco.javier.calzado
> @eri
> csson.com>> wrote: Hi guys,
>
> I am experiencing a deadlock with just 2 threads in a single reader &
> multiple writer scenario. This is, threads read from the same input
> file (using different handlers) and then write different output files.
> Deadlock comes when the block cache gets filled. The situation is the
> following:
>
>
> - T1 and T2 read datasets D1 and D2, both pointing to the same
> input raster (GTiff).
>
> - Block cache gets filled.
>
> - T1 tries to lock one block in the cache to write data. But cache
> is full, so it tries to free dirty blocks from T2 (as seen in
> Internalize() method). For that purpose, it requires apparently a
> mutex from D2.
>
> - However T2 is in a state where must wait for thread T1 to finish
> working with T2’s blocks. In this state, T2 has a mutex acquired from D2.
>
> At least, that is what it seems to be happening based on source code.
> Maybe I’m wrong, I don’t have a full picture overview about how GDAL
> is internally working. The thing is that I can reproduce this issue
> with the following test code and dataset:
> https://drive.google.com/file/d/0B-OCl1FjBi0YSkU3RUozZjc5SnM/view?usp=
> shar
> ing
>
> Oddly enough, ticket with number #6163 is supposed to fix this, but
> its failing in my case. I am working with GDAL 2.1.0 version under
> VS2015 (x32, Debug) compilation.
>
> Even, what do you think?
>
> Thanks!
> Javier C.
>
>
> _______________________________________________
> gdal-dev mailing list
> [email protected]<mailto:[email protected]>
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
>
>
> --
> Andrew Bell
> [email protected]<mailto:[email protected]>
--
Spatialys - Geospatial professional services http://www.spatialys.com
_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev
_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev