Hi Even,

Your changes have fixed the issue with multiple threads, now they can write 
simultaneously without getting locked. I've done some test in GTiff format and 
everything went OK. However, using a custom-made driver for a new format 
(called Wizard), a new problem has arisen :( .

In that case, although threads finish its task and generate its corresponding 
output raster, 2 weird things happen randomly and (again) only when the block 
cache gets full during writing:
- Some blocks (lines) are missing, so they appear in the file with no data 
value. 
- Resulting file format is corrupted. 

It seems that 2 or more threads writing simultaneously (althought different 
threads) mess up the block cache to each other when trying to flush dirty 
blocks. I have tried to manually flush dirty blocks (either calling 
block->FlushDirtyBlocks() or block->Write() ) everytime a block is completely 
written in the cache, but it doesn't work.

My driver is simple, nothing too complex, actually was mostly taken from other 
drivers already implemented. Check below the IWriteMethod, just in case I made 
a mistake.

Do you have any clue what is happening here?


// *********************************************************************

CPLErr WIZARDRasterBand::IWriteBlock(CPL_UNUSED int nBlockXOff, int nBlockYOff, 
void * pImage)
{
        CPLAssert(nBlockXOff == 0);
        CPLAssert(eAccess == GA_Update);

        // By default, the count of pixels we want to write will be the natural 
block
        // size in pixels, except for the case of a padded block.
        int nPixelCount = this->nBlockSizeInPixels;
        if (this->bBlockPadding && (nBlockYOff == this->nPaddedBlockYIndex))
        {
                nPixelCount = this->nPaddedBlockSizeInPixels;
        }
        
        // Lets move it to the first position of the desired block.     
        if (VSIFSeekL(this->poWDS->fp,
                this->nHeaderSizeInBytes + this->nBlockSizeInBytes * nBlockYOff,
                SEEK_SET) != 0)
        {
                CPLError(CE_Failure, CPLE_FileIO,
                        "Seek failed when writing block to Wizard driver: %s", 
VSIStrerror(errno));
                return CE_Failure;
        }   

        // Now Write.
        if (VSIFWriteL(pImage, this->nPixelSizeInBytes, nPixelCount, 
this->poWDS->fp) !=
                nPixelCount)
        {
                CPLError(CE_Failure, CPLE_FileIO,
                        "Write failed when dumping block to Wizard driver: %s", 
VSIStrerror(errno));
                return CE_Failure;
        }

        return CE_None;
}
// *********************************************************************


Thanks,
BR. Javier.



-----Original Message-----
From: gdal-dev [mailto:[email protected]] On Behalf Of Francisco 
Javier Calzado
Sent: 27 September, 2016 09:50
To: Even Rouault <[email protected]>; [email protected]
Subject: Re: [gdal-dev] Multithread deadlock

Hi Even,

Thanks for such a quick fix! I'm gonna apply the patch and recompile GDAL and 
will let you know :)

Keep in touch.
Best Regards,
Javier Calzado



-----Original Message-----
From: Even Rouault [mailto:[email protected]]
Sent: 26 September, 2016 17:53
To: [email protected]
Cc: Francisco Javier Calzado <[email protected]>; Andrew 
Bell <[email protected]>
Subject: Re: [gdal-dev] Multithread deadlock

Hi,

I admire Andrew's enthousiasm and would happily let him tackle the next bug 
reported in this area ;-)

I could reproduce the deadlock with the same stack trace  and have pushed a fix 
per https://trac.osgeo.org/gdal/ticket/6661. This was actually not a typical 
deadlock situation, but a undefined behaviour caused by trying to acquire a 
recursive mutex than was previously released more times than it had been 
acquired.

Without this patch, a "workaround" would be to define the 
GDAL_ENABLE_READ_WRITE_MUTEX config option to NO to disable the per-dataset 
mutex. had added this option since I wasn't really sure that the per-dataset 
mutex wouldn't introduce deadlock situations. But when defining it, you'll get 
undefined behaviour (=potentially crashing or causing corruptions) due to 2 
threads potentially calling the IWriteBlock() method of the same dataset,which 
was the GDAL 1.X behaviour.

Clearly multi-threading scenarios involving writing is the point where the 
global block cache mechanism + the band-aid of the per-dataset R/W mutex are 
showing their limit in terms of design&maintenance complexity, and scalability. 
A per-dataset block cache would avoid such headaches (the drawback would be to 
define a per-dataset block cache size)

Even

> Sure Andrew,
> 
> Here it is the call stack from Visual Studio for both threads (I just 
> copied the top calls where GDAL is involved, just for easy reading. If 
> you need the whole stack just let me know):
> 
> THREAD 1:
> 
>                 ntdll.dll!_NtWaitForSingleObject@12‑() Unknown
>                ntdll.dll!_RtlpWaitOnCriticalSection@8‑()            
> Unknown ntdll.dll!_RtlEnterCriticalSection@4‑()    Unknown
>                gdal201.dll!CPLAcquireMutex(_CPLMutex * hMutexIn, double
> dfWaitInSeconds) Line 806               C++
> gdal201.dll!GDALDataset::EnterReadWrite(GDALRWFlag eRWFlag) Line 6102     
>   C++ gdal201.dll!GDALRasterBand::EnterReadWrite(GDALRWFlag eRWFlag) Line
> 5290 C++ gdal201.dll!GDALRasterBlock::Write() Line 742    C++
> gdal201.dll!GDALRasterBlock::Internalize() Line 917          C++
> gdal201.dll!GDALRasterBand::GetLockedBlockRef(int nXBlockOff, int
> nYBlockOff, int bJustInitialize) Line 1126                C++
> 
> >             Test.exe!RasterBandPixelAccess::SetValueAtPixel<short>(const
> >             int & pX, const int & pY, const short & value) Line 180     
> >                       C++
> 
> THREAD 2:
> 
>                 ntdll.dll!_NtWaitForSingleObject@12‑() Unknown
>                KernelBase.dll!_WaitForSingleObjectEx@12‑()   Unknown
>                kernel32.dll!_WaitForSingleObjectExImplementation@12‑()     
>   Unknown kernel32.dll!_WaitForSingleObject@8‑() Unknown
>                gdal201.dll!CPLCondWait(_CPLCond * hCond, _CPLMutex *
> hClientMutex) Line 937           C++
> gdal201.dll!GDALAbstractBandBlockCache::WaitKeepAliveCounter() Line 134   
>    C++ gdal201.dll!GDALArrayBandBlockCache::FlushCache() Line 312    C++
> gdal201.dll!GDALRasterBand::FlushCache() Line 865         C++
> gdal201.dll!GDALDataset::FlushCache() Line 386 C++
>                gdal201.dll!GDALPamDataset::FlushCache() Line 159        C++
>                gdal201.dll!GTiffDataset::Finalize() Line 6180       C++
>                gdal201.dll!GTiffDataset::~GTiffDataset() Line 6135         
>  C++ gdal201.dll!GTiffDataset::`scalar deleting destructor'(unsigned int) 
>          C++ gdal201.dll!GDALClose(void * hDS) Line 2998       C++
> 
> >             Test.exe!main::__l2::<lambda>(std::basic_string<char,std::cha
> >             r_traits<char>,std::allocator<char> > sourcefilePath,
> >             std::basic_string<char,std::char_traits<char>,std::allocator
> >             <char> > targetFilePath, int threadID) Line 66           C++
> 
> From: Andrew Bell [mailto:[email protected]]
> Sent: 26 September, 2016 16:06
> To: Francisco Javier Calzado <[email protected]>
> Cc: [email protected]
> Subject: Re: [gdal-dev] Multithread deadlock
> 
> Deadlocks are usually easy to debug if you can get a traceback when 
> deadlocked.  If you can attach with gdb (or run in the debugger) and 
> reproduce and post the stack at the time ('where' from gdb), it should 
> be no problem to fix.  Trying to reproduce on different hardware can 
> be difficult.
> 
> On Mon, Sep 26, 2016 at 9:33 AM, Francisco Javier Calzado 
> <[email protected]<mailto:francisco.javier.calzado
> @eri
> csson.com>> wrote: Hi guys,
> 
> I am experiencing a deadlock with just 2 threads in a single reader & 
> multiple writer scenario. This is, threads read from the same input 
> file (using different handlers) and then write different output files.
> Deadlock comes when the block cache gets filled. The situation is the 
> following:
> 
> 
> -          T1 and T2 read datasets D1 and D2, both pointing to the same
> input raster (GTiff).
> 
> -          Block cache gets filled.
> 
> -          T1 tries to lock one block in the cache to write data. But cache
> is full, so it tries to free dirty blocks from T2 (as seen in
> Internalize() method). For that purpose, it requires apparently a 
> mutex from D2.
> 
> -          However T2 is in a state where must wait for thread T1 to finish
> working with T2’s blocks. In this state, T2 has a mutex acquired from D2.
> 
> At least, that is what it seems to be happening based on source code. 
> Maybe I’m wrong, I don’t have a full picture overview about how GDAL 
> is internally working. The thing is that I can reproduce this issue 
> with the following test code and dataset:
> https://drive.google.com/file/d/0B-OCl1FjBi0YSkU3RUozZjc5SnM/view?usp=
> shar
> ing
> 
> Oddly enough, ticket with number #6163 is supposed to fix this, but 
> its failing in my case. I am working with GDAL 2.1.0 version under
> VS2015 (x32, Debug) compilation.
> 
> Even, what do you think?
> 
> Thanks!
> Javier C.
> 
> 
> _______________________________________________
> gdal-dev mailing list
> [email protected]<mailto:[email protected]>
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
> 
> 
> 
> --
> Andrew Bell
> [email protected]<mailto:[email protected]>

--
Spatialys - Geospatial professional services http://www.spatialys.com 
_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev
_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to